Context
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective
Data Dictionary
Domain Information
mortgage - it represents the amount the customer owes on a loan they took to purchase a house.
Key Questions that can be answered
The dataset contains 14 columns, including:
# verify
import sys
print(sys.executable, sys.version)
/Users/nipunshah/anaconda3/bin/python 3.11.4 (main, Jul 5 2023, 08:54:11) [Clang 14.0.6 ]
# Use this cell to install the libraries if absent (for entire project)
# command to install missingno (uncomment below if not installed)
# !conda install -y missingno
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# For missing value visualization
import tabulate as tb
# ML - model building
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, recall_score, precision_score, f1_score, ConfusionMatrixDisplay, make_scorer
# helful for plots of decision boundary
from sklearn import tree
# Cross Validation
from sklearn.model_selection import GridSearchCV, cross_val_score
# Suppress warnings
import warnings
warnings.filterwarnings('ignore') # Ignores all warnings (optional)
# Global options and themes
# Set pandas display options for better readability
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.max_rows', 100) # Show 100 rows by default
# Seaborn theme for consistent plotting style
sns.set_theme(style="whitegrid", palette="muted", context="notebook") # You can change it to darkgrid, ticks, etc.
plt.rcParams["figure.figsize"] = (15, 5) # Set default figure size for plots
plt.rcParams["font.size"] = 14 # Set font size for readability
# restrict float display to 2 decimal places
pd.options.display.float_format = '{:.2f}'.format
# Helpers
def tb_describe(df_col):
"""
Helper function to display descriptive statistics in a nicely formatted table
Parameters:
df_col : pandas Series or DataFrame column
The column to generate descriptive statistics for
Returns:
None - prints formatted table
"""
stats = df_col.describe().to_frame().T
print(tb.tabulate(stats, headers='keys', tablefmt='simple', floatfmt='.2f'))
# Primitive Utils
def snake_to_pascal(snake_str, join_with=" "):
"""Convert snake_case to PascalCase (eg my_name -> MyName)
Args:
snake_str (str): string to convert
join_with (str): character to join the components, default is space
"""
components = snake_str.split("_")
return join_with.join(x.title() for x in components)
def format_pct(val):
"""Format a val as percentage i.e max 2 decimal value & adding % at the end"""
return f"{val:.1f}%"
def to_percentage(value):
"""value is expected to be a normalized float value in [0, 1]"""
return format_pct(value * 100)
def draw_countplot(
df,
colName: str,
*,
label=None,
rot=0,
order=None,
sort=True,
palette=None,
showgrid=None,
):
"""
Draw a count plot with value labels and optional features
Parameters:
-----------
df : pandas DataFrame
The dataframe containing the data
colName : str
Column name to plot
label : str, optional
Custom x-label (defaults to formatted column name)
rot : int, optional
Rotation angle for x-axis labels
order : list, optional
Custom order for categories
sort : bool, optional
Whether to sort by count (only used if order is None)
palette : str or list, optional
Color palette for bars
showgrid : bool, optional
Whether to show grid lines
"""
# prep (meta) --
xlabel = label if label else snake_to_pascal(colName)
priority = None
if order is not None:
priority = order
elif sort:
# sort by count
priority = df[colName].value_counts().index
# plot (crux) --
ax = sns.countplot(data=df, x=colName, order=priority, palette=palette)
# display count above each bar
ax.bar_label(ax.containers[0])
# Calculate & mark percentages
total = len(df[colName])
for p in ax.patches:
freq = p.get_height()
percentage = to_percentage(freq / total)
ax.annotate(
percentage,
(p.get_x() + p.get_width() / 2.0, freq / 2.0),
ha="center",
va="center",
)
# aesthetics --
plt.title(f"Frequency of {xlabel}")
plt.xlabel(xlabel)
plt.ylabel("count")
plt.xticks(rotation=rot)
if showgrid:
plt.grid(True)
plt.show()
# list all files in current directory
!ls
Loan_Modelling.csv notebook1.ipynb notebook1backup.ipynb
# Load the dataset
df = pd.read_csv('Loan_Modelling.csv')
# backup of original df
df_original = df.copy()
# Peek the dataset
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# Shape (Size)
df.shape
(5000, 14)
There are total 5000 customers in dataset and 14 attributes per each customer
# Data Types
df.dtypes
ID int64 Age int64 Experience int64 Income int64 ZIPCode int64 Family int64 CCAvg float64 Education int64 Mortgage int64 Personal_Loan int64 Securities_Account int64 CD_Account int64 Online int64 CreditCard int64 dtype: object
All the columns holds values in numerical format
# Unique values in each column
df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
The following columns can be treated as categorical variables: CreditCard, Online, CD_Account, Securities_Account, Personal_Loan, Education, Family as they possess few unique values
Additionally, ID and ZIPCode columns can be excluded from analysis since they contain too many unique values and don't provide meaningful patterns.
# Columns Information
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Missing values
missing_values = df.isnull().sum().sum()
missing_values
0
There are no missing values in dataset
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
Number of duplicate rows: 0
No duplicates as well :- All the rows are uniques (ie No repeat of customer in records)
# Statistical summary
stats = df.describe(include='all').T
stats
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.00 | 2500.50 | 1443.52 | 1.00 | 1250.75 | 2500.50 | 3750.25 | 5000.00 |
| Age | 5000.00 | 45.34 | 11.46 | 23.00 | 35.00 | 45.00 | 55.00 | 67.00 |
| Experience | 5000.00 | 20.10 | 11.47 | -3.00 | 10.00 | 20.00 | 30.00 | 43.00 |
| Income | 5000.00 | 73.77 | 46.03 | 8.00 | 39.00 | 64.00 | 98.00 | 224.00 |
| ZIPCode | 5000.00 | 93169.26 | 1759.46 | 90005.00 | 91911.00 | 93437.00 | 94608.00 | 96651.00 |
| Family | 5000.00 | 2.40 | 1.15 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 |
| CCAvg | 5000.00 | 1.94 | 1.75 | 0.00 | 0.70 | 1.50 | 2.50 | 10.00 |
| Education | 5000.00 | 1.88 | 0.84 | 1.00 | 1.00 | 2.00 | 3.00 | 3.00 |
| Mortgage | 5000.00 | 56.50 | 101.71 | 0.00 | 0.00 | 0.00 | 101.00 | 635.00 |
| Personal_Loan | 5000.00 | 0.10 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| Securities_Account | 5000.00 | 0.10 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| CD_Account | 5000.00 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| Online | 5000.00 | 0.60 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| CreditCard | 5000.00 | 0.29 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
Observations 🔍
Total customers :- 5000 carrying 14 characteristics related to bank
Numerical Variables:
Categorical/Binary Variables:
Other Notes:
⚡ Note : Experience lowest value is -ve, indicating something fussy !
PLOTS: Since Decision Trees are non-parametric and handle outliers well, we focus less on normality and transformations but more on variable importance and interactions
df.columns
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard'],
dtype='object')
uniques = df['Age'].nunique()
uniques
45
Age is discrete numerical value with meaningful ordering which possess low cardinality !!
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Plot 1: Histogram with KDE
sns.histplot(data=df, x='Age', kde=True, ax=ax1)
ax1.set_title('Age Distribution with KDE')
ax1.set_xlabel('Age')
ax1.set_ylabel('Count')
# Plot 2: Box Plot of Age
sns.boxplot(data=df, y='Age', ax=ax2)
ax2.set_title('Age Distribution')
ax2.set_ylabel('Age')
# Adjust layout and display
plt.tight_layout()
plt.show()
tb_describe(df['Age'])
count mean std min 25% 50% 75% max --- ------- ------ ----- ----- ----- ----- ----- ----- Age 5000.00 45.34 11.46 23.00 35.00 45.00 55.00 67.00
print('Skewness of Age : ', df['Age'].skew())
print('Kurtosis of Age : ', df['Age'].kurt())
Skewness of Age : -0.02934068151284029 Kurtosis of Age : -1.1530672623735783
Skewness (-0.0293) → Nearly Symmetric
Kurtosis (-1.153) → Flat Distribution
Decision Tree will consider a wide range of values when splitting.
df['Experience'].nunique()
47
tb_describe(df['Experience'])
count mean std min 25% 50% 75% max ---------- ------- ------ ----- ----- ----- ----- ----- ----- Experience 5000.00 20.10 11.47 -3.00 10.00 20.00 30.00 43.00
Negative values: This is unusual! Experience should not logically have negative values. There might be due to data entry errors or something else
# check for negative values
negative_experience = df[df['Experience'] < 0]
print('Number of customers with -ve professional experience : ', negative_experience.shape[0])
Number of customers with -ve professional experience : 52
This is relatiely small (ie 52/5000 = 1%)
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Plot 1: Histogram with KDE
sns.histplot(data=df, x='Experience', kde=True, ax=ax1)
ax1.set_title('Experience Distribution with KDE')
ax1.set_xlabel('Years of Experience')
ax1.set_ylabel('Count')
# Plot 2: Box Plot
sns.boxplot(data=df, y='Experience', ax=ax2)
ax2.set_title('Experience Distribution')
ax2.set_ylabel('Years of Experience')
plt.tight_layout()
plt.show()
print('Skewness of Experience : ', df['Experience'].skew())
print('Kurtosis of Experience : ', df['Experience'].kurt())
Skewness of Experience : -0.026324688402384513 Kurtosis of Experience : -1.12152278596998
Skewness: -0.03 (close to 0) suggests that the distribution of Experience is nearly symmetrical. There’s a slight negative skew, but it’s minimal.
Kurtosis: -1.12 indicates that the distribution is flatter than a normal distribution with fewer and less extreme outliers.
The symmetrical distribution is good for Decision Trees since the model doesn’t rely on assumptions about the data's distribution.
Flatness in the distribution suggests that no extreme outliers are present, which is helpful in avoiding unnecessary splits caused by extreme values.
tb_describe(df['Income'])
count mean std min 25% 50% 75% max ------ ------- ------ ----- ----- ----- ----- ----- ------ Income 5000.00 73.77 46.03 8.00 39.00 64.00 98.00 224.00
print('Skewness of Income : ', df['Income'].skew())
print('Kurtosis of Income : ', df['Income'].kurt())
Skewness of Income : 0.8413386072610816 Kurtosis of Income : -0.04424418973549038
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2)
# Plot 1: Histogram with KDE
sns.histplot(data=df, x='Income', kde=True, ax=ax1)
ax1.set_title('Income Distribution with KDE')
ax1.set_xlabel('Income (in thousands)')
ax1.set_ylabel('Count')
# Plot 2: Box Plot
sns.boxplot(y=df['Income'], ax=ax2)
ax2.set_title('Income Distribution (Box Plot)')
ax2.set_ylabel('Income (in thousands)')
plt.tight_layout()
plt.show()
Observation 🔍
Looking at our plot:
The decision tree might create splits at these valleys because:
log_income = np.log10(df['Income'] + 1)
print('Skewness of Log-Transformed Income : ', log_income.skew())
print('Kurtosis of Log-Transformed Income : ', log_income.kurt())
Skewness of Log-Transformed Income : -0.42071189305490686 Kurtosis of Log-Transformed Income : -0.309721731205979
Now the distribution is relatively more balanced
tb_describe(log_income)
count mean std min 25% 50% 75% max ------ ------- ------ ----- ----- ----- ----- ----- ----- Income 5000.00 1.78 0.30 0.95 1.60 1.81 2.00 2.35
# Just for understanding & verification
l = 10**0.95
r = 10**2.35
print(l, r)
8.912509381337454 223.872113856834
This is near to 8 <-> 224 (from original range)
# Create log-transformed plot with reference lines
plt.figure(figsize=(12, 6))
# Main histogram with KDE
sns.histplot(data=df, x='Income', kde=True, log_scale=True)
# Add vertical reference lines with annotations
plt.axvline(x=100, color='r', linestyle='--', alpha=0.5)
plt.text(105, plt.ylim()[1]*0.9, '$100K (10^2)', rotation=0)
plt.axvline(x=31.6, color='g', linestyle='--', alpha=0.5)
plt.text(33, plt.ylim()[1]*0.8, '$31.6K (10^1.5)', rotation=0)
plt.axvline(x=15.85, color='b', linestyle='--', alpha=0.5)
plt.text(15.85, plt.ylim()[1]*0.7, '$15.85K (10^1.2)', rotation=0)
plt.title('Log-Transformed Income Distribution with Reference Points')
plt.xlabel('Income (log scale, in thousands)')
plt.ylabel('Count')
plt.show()
Advantages
Observation
(home address zip code (ie pincode))
# check for unique values
df['ZIPCode'].nunique()
467
ref: https://www.smarty.com/docs/zip-codes-101
The first 3 digits denotes Major Mail Processing Center (SCF) (ie Geography)
Let's try to explore it
# Create a new column with first 3 digits of ZIP code
area_code = df['ZIPCode'].astype(str).str[:3]
# Count unique regions
n_regions = area_code.nunique()
print(f"Number of unique regions (3-digit ZIP): {n_regions}")
# Get distribution of regions
area_code.describe()
Number of unique regions (3-digit ZIP): 57
count 5000 unique 57 top 900 freq 375 Name: ZIPCode, dtype: object
The zip code starting with 900 is in California. It includes zip codes for cities like Los Angeles and Oakwood
Although reducing zip codes to 57 regions helps with cardinality, including them may not be viable for the Personal Loan Campaign, as the geographic information might not significantly influence loan acceptance, potentially adding unnecessary complexity without substantial predictive value. (this needs domain know-how inclusion)
# If below cell didnt give any output then uncomment next cell and proceed with installation
!pip list | grep sqlalchemy
sqlalchemy_mate 2.0.0.0
# uncomment if dependency is absent (to install it)
#!pip install sqlalchemy_mate==2.0.0
# If below cell didnt give any output then uncomment next cell and proceed with installation
!pip list | grep uszipcode
uszipcode 1.0.1
# uncomment if dependency is absent (to install it)
#!pip install uszipcode==1.0.1
from uszipcode import SearchEngine
# Initialize search engine
search = SearchEngine()
# Get unique full ZIP codes
unique_zips = df['ZIPCode'].unique()
# Create lists to store states and cities
states = []
cities = []
# Look up each complete ZIP code
for zip_code in unique_zips:
zip_info = search.by_zipcode(str(zip_code))
if zip_info:
states.append(zip_info.state)
cities.append(zip_info.major_city)
# Get unique counts
unique_states = len(set([s for s in states if s])) # exclude None
unique_cities = len(set([c for c in cities if c])) # exclude None
unique_states
1
unique_cities
244
📌 Points:
Single State: All customers are from same state
244 Cities:
ZIP code doesn't add significant predictive value
Observations 🔍
df['Family'].nunique()
4
As there are 4 distinct values so lets treat it as categorical variable
# make it categorical
family_col = df['Family'].astype('category')
family_col.dtype
CategoricalDtype(categories=[1, 2, 3, 4], ordered=False)
family_col.describe()
count 5000 unique 4 top 1 freq 1472 Name: Family, dtype: int64
plt.figure(figsize=(15, 6))
draw_countplot(df, 'Family', label='Family Size')
family_counts = df['Family'].value_counts()
family_counts.plot.pie(autopct='%1.1f%%', figsize=(5, 5))
plt.ylabel('') # Remove y-label
plt.title('Family Size Distribution')
plt.show()
family_counts = df['Family'].value_counts()
family_counts
1 1472 2 1296 4 1222 3 1010 Name: Family, dtype: int64
The family size feature shows a fairly balanced distribution with the majority of customers having family sizes 1, 2, or 4. The smallest group is 3.
tb_describe(df['CCAvg'])
count mean std min 25% 50% 75% max ----- ------- ------ ----- ----- ----- ----- ----- ----- CCAvg 5000.00 1.94 1.75 0.00 0.70 1.50 2.50 10.00
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Plot 1: Histogram with KDE
sns.histplot(data=df, x='CCAvg', kde=True, ax=ax1)
ax1.set_title('Credit Card Spending Distribution with KDE')
ax1.set_xlabel('Average CC Spending (thousand dollars)')
ax1.set_ylabel('Count')
# Plot 2: Box Plot
sns.boxplot(y=df['CCAvg'], ax=ax2)
ax2.set_title('Credit Card Spending Distribution')
ax2.set_ylabel('Average CC Spending (thousand dollars)')
# Adjust layout and display
plt.tight_layout()
plt.show()
Multiple valleys can be spotted from the plot, hence making them candidate for split by decision classifier
# Print skewness and kurtosis
print("\nSkewness:", df['CCAvg'].skew())
print("Kurtosis:", df['CCAvg'].kurt())
Skewness: 1.5984433366678663 Kurtosis: 2.646706374237909
Observations 🔍
Plot seems to be skewed on right side with heavy tail
plt.figure(figsize=(12, 6))
# Filter out zeros and plot with log scale
non_zero_ccavg = df[df['CCAvg'] > 0]['CCAvg']
sns.histplot(data=non_zero_ccavg, kde=True, log_scale=True)
plt.title('Log-Transformed CCAvg Distribution (Non-zero values)')
plt.xlabel('Average CC Spending (log scale, thousand dollars)')
plt.ylabel('Count')
# Add reference lines
plt.axvline(x=0.46, color='y', linestyle='--', alpha=0.5)
plt.text(0.46, plt.ylim()[1]*0.85, '$0.46K', rotation=0)
plt.axvline(x=1, color='r', linestyle='--', alpha=0.5)
plt.text(1.1, plt.ylim()[1]*0.9, '$1K', rotation=0)
plt.axvline(x=2, color='g', linestyle='--', alpha=0.5)
plt.text(2.1, plt.ylim()[1]*0.8, '$2K', rotation=0)
plt.axvline(x=5, color='b', linestyle='--', alpha=0.5)
plt.text(5.1, plt.ylim()[1]*0.7, '$5K', rotation=0)
plt.show()
# Print number of zero values
print(f"Number of customers with zero CC spending: {len(df) - len(non_zero_ccavg)}")
Number of customers with zero CC spending: 106
This is much better comparatively to non-log-transformed
Good for finding split points
Hence classifier may focus on customers, spending atleast around $500 as their credit card expenditure
transformed_ccavg = np.log10(df['CCAvg'] + 1)
print('Skewness of Log-Transformed CCAvg : ', transformed_ccavg.skew())
print('Kurtosis of Log-Transformed CCAvg : ', transformed_ccavg.kurt())
Skewness of Log-Transformed CCAvg : 0.31922379644464294 Kurtosis of Log-Transformed CCAvg : -0.46767694208709853
Observation:
Classifier might focus on customer spending more than $500, roughly (ie > $460)
NOTE: There are few gaps as well for much lesser spending side which can be picked by tree for splitting criteria as it clearly shows 3 clear gaps
Hence this can be one of the important feature, decision tree might consider for splitting
(categorical - ordinal)
df['Education'].nunique()
3
plt.figure(figsize=(15, 6))
draw_countplot(df, 'Education', label='Education Level')
education_counts = df['Education'].value_counts()
education_counts.plot.pie(autopct='%1.1f%%', figsize=(5, 5))
plt.ylabel('') # Remove y-label
plt.title('Education Level Distribution')
plt.show()
education_counts
1 2096 3 1501 2 1403 Name: Education, dtype: int64
Observations 🔍
(house loan in dollar, unit - thousand)
tb_describe(df['Mortgage'])
count mean std min 25% 50% 75% max -------- ------- ------ ------ ----- ----- ----- ------ ------ Mortgage 5000.00 56.50 101.71 0.00 0.00 0.00 101.00 635.00
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Plot 1: Histogram with KDE
sns.histplot(data=df, x='Mortgage', kde=True, ax=ax1)
ax1.set_title('Mortgage Distribution with KDE')
ax1.set_xlabel('Mortgage Amount (thousand dollars)')
ax1.set_ylabel('Count')
# Plot 2: Box Plot
sns.boxplot(y=df['Mortgage'], ax=ax2)
ax2.set_title('Mortgage Distribution')
ax2.set_ylabel('Mortgage Amount (thousand dollars)')
# Adjust layout and display
plt.tight_layout()
plt.show()
# skew and kurtosis
print('Skewness of Mortgage : ', df['Mortgage'].skew())
print('Kurtosis of Mortgage : ', df['Mortgage'].kurt())
Skewness of Mortgage : 2.1040023191079444 Kurtosis of Mortgage : 4.756796669311615
📌 Points
Many Customers with no mortagage value
There is a huge gap and a valley as well at same spot between high and low peak of graph, indicating a natural separation in the data distribution
# log-transformed plot
plt.figure(figsize=(12, 6))
# Filter out zeros and plot with log scale
non_zero_mortgage = df[df['Mortgage'] > 0]['Mortgage']
sns.histplot(data=non_zero_mortgage, kde=True, log_scale=True)
plt.title('Log-Transformed Mortgage Distribution (Non-zero values)')
plt.xlabel('Mortgage Amount (log scale, thousand dollars)')
plt.ylabel('Count')
# Add reference lines
plt.axvline(x=50, color='r', linestyle='--', alpha=0.5)
plt.text(52, plt.ylim()[1]*0.9, '$50K', rotation=0)
plt.axvline(x=100, color='g', linestyle='--', alpha=0.5)
plt.text(105, plt.ylim()[1]*0.8, '$100K', rotation=0)
plt.axvline(x=200, color='b', linestyle='--', alpha=0.5)
plt.text(210, plt.ylim()[1]*0.7, '$200K', rotation=0)
plt.show()
# Print number of zero values
print(f"Number of customers with no mortgage: {len(df) - len(non_zero_mortgage)}")
Number of customers with no mortgage: 3462
transformed_mortgage = np.log10(df['Mortgage'] + 1)
# Analyze
skew = transformed_mortgage.skew()
kurt = transformed_mortgage.kurt()
print('Skewness of Log-Transformed Mortgage:', skew)
print('Kurtosis of Log-Transformed Mortgage:', kurt)
Skewness of Log-Transformed Mortgage: 0.8766882783607725 Kurtosis of Log-Transformed Mortgage: -1.1680156143543925
📌 Points
Log Transform plot almost discard people with no mortgage
Zero values consideration: Since many mortgages are zero, a log transformation might not impact them directly, so an additional categorical flag (e.g., "Has Mortgage" vs. "No Mortgage") could be useful.
The first peak at 100K might represent a concentration of loans around this value (possibly the most common loan amount or a frequent range).
The second peak at 200K could suggest another cluster of loans, indicating that borrowers tend to group around these two values.
Decision trees might better capture patterns among customers with mortgages
Observation 🔍
(Opted or not) | Binary
df['Securities_Account'].value_counts()
0 4478 1 522 Name: Securities_Account, dtype: int64
draw_countplot(df, 'Securities_Account', label='Securities Account Status')
📌 Points
Highly imbalanced (≈9:1 ratio)
Could be a strong discriminator if correlated with loan acceptance
Since Decision Trees can easily handle categorical data, this imbalance may cause the tree to heavily favor the 0 category. This could potentially reduce the model's performance for the minority class (1) unless handled appropriately.
Poor Recall for Minority Class
The recall for class 1 (Has Security Account) could be low, meaning the model may fail to identify many of the instances where people actually have a security account.
(Opted or not) | Binary
df['CD_Account'].value_counts()
0 4698 1 302 Name: CD_Account, dtype: int64
draw_countplot(df, 'CD_Account', label='CD Account Status')
Observations 🔍
Decision Tree algorithm might overly favor the 0 category (ie No CD account).
(Use or not) | Binary
df['Online'].value_counts()
1 2984 0 2016 Name: Online, dtype: int64
draw_countplot(df, 'Online', label='Online Banking Status')
Observations 🔍
Fairly balanced
draw_countplot(df, 'CreditCard', label='Credit Card by other bank')
df['CreditCard'].value_counts()
0 3530 1 1470 Name: CreditCard, dtype: int64
Observations 🔍
(Opted or not) | Binary
df['Personal_Loan'].value_counts()
0 4520 1 480 Name: Personal_Loan, dtype: int64
draw_countplot(df, 'Personal_Loan', label='Personal Loan Status')
📌 Points
Observations 🔍
This could lead to poor recall for class 1, as the tree may be biased toward predicting 0.
Decision Trees are prone to overfitting, especially when there's a class imbalance. The tree might create deep branches for class 0 (the majority) to get perfect predictions, but struggle to generalize for class 1 (the minority).
For accuracy, the model might seem to perform well just because it predicts class 0 correctly most of the time.
Need to tune model (ie PrePruning) So as to mitigiate the effect of Imbalance in Target
Summary
The Personal Loan column is imbalanced, and this could impact Decision Tree performance, particularly in predicting the minority class. Balancing the data and tuning the model will help improve the prediction accuracy for the minority class.
df.columns
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard'],
dtype='object')
plt.figure(figsize=(15, 6))
sns.scatterplot(data=df,
x='Age',
y='Income',
hue='Personal_Loan',
alpha=0.8,
) # Larger point size
plt.title('Age vs Income by Personal Loan Status')
plt.xlabel('Age (years)')
plt.ylabel('Income (thousand dollars)')
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=['Not Accepted', 'Accepted'], title='Personal Loan')
plt.show()
# Optional: Add summary statistics by loan status
df.groupby('Personal_Loan')['Age'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Personal_Loan | ||||||||
| 0 | 4520.00 | 45.37 | 11.45 | 23.00 | 35.00 | 45.00 | 55.00 | 67.00 |
| 1 | 480.00 | 45.07 | 11.59 | 26.00 | 35.00 | 45.00 | 55.00 | 65.00 |
df.groupby('Personal_Loan')['Income'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Personal_Loan | ||||||||
| 0 | 4520.00 | 66.24 | 40.58 | 8.00 | 35.00 | 59.00 | 84.00 | 224.00 |
| 1 | 480.00 | 144.75 | 31.58 | 60.00 | 122.00 | 142.50 | 172.00 | 203.00 |
Key observations from Age vs Income analysis:
Income patterns:
Key insight:
sns.violinplot(x=pd.qcut(df['Age'], q=4), y='Income', hue='Personal_Loan', data=df, split=True)
plt.title('Violin Plot of Income vs Age (Binned, with Personal Loan as Hue)')
plt.xlabel('Age (Binned into Quartiles)')
plt.ylabel('Income')
plt.xticks(rotation=45)
plt.show()
Observations 🔍
Decision Tree can find more balanced split points in case of acceptors as compared to non-acceptors because of comparatively good symmetric distribution
This reinforces that income is the dominant factor in loan acceptance, while age plays a minimal role in the decision-making process
sns.boxplot(data=df,
x='Family',
y='Income',
hue='Personal_Loan')
plt.title('Income Distribution by Family Size and Loan Status')
plt.xlabel('Family Size')
plt.ylabel('Income (thousand dollars)')
plt.show()
Observations 🔍:
Income is the key factor - loan acceptors consistently have higher incomes (140k-150k) compared to non-acceptors (60k-70k) across all family sizes
Family size doesn't play a major role in loan acceptance - the income patterns stay similar whether someone has a small or large family
There's a clear income divide around $100k - people above this income level are much more likely to accept loans
The income spread is wider for those who don't take loans, suggesting other factors may influence their decision besides just income level
df[['Income', 'CCAvg']].corr()
| Income | CCAvg | |
|---|---|---|
| Income | 1.00 | 0.65 |
| CCAvg | 0.65 | 1.00 |
Seems people with higher income tends to spend more on their credit card, probably
# Scatter plot of Income vs Credit Card Spending
sns.scatterplot(data=df,
x='Income',
y='CCAvg',
hue='Personal_Loan',
alpha=0.6)
plt.title('Income vs Credit Card Spending by Loan Status')
plt.xlabel('Income (thousand dollars)')
plt.ylabel('Average CC Spending (thousand dollars)')
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=['Not Accepted', 'Accepted'], title='Personal Loan')
plt.show()
Observations :
Strong positive correlation (0.65) between income and credit card spending - as income increases, people tend to spend more on credit cards
Loan acceptors (orange dots) cluster in the higher income range (>$100k) and also tend to have higher credit card spending, comparatively
Non-acceptors (blue dots) are more spread out but concentrated in lower income and credit card spending ranges
📌 Based on above 3 analysis
Income appears to be a strong differentiator between loan approval and rejection, indicating that it could be a key feature for decision splits in predicting personal loan eligibility.
pd.crosstab(df['Education'], df['Personal_Loan'])
| Personal_Loan | 0 | 1 |
|---|---|---|
| Education | ||
| 1 | 2003 | 93 |
| 2 | 1221 | 182 |
| 3 | 1296 | 205 |
plt.figure(figsize=(15, 6))
ax = sns.countplot(data=df,
x='Education',
hue='Personal_Loan')
# Add count labels on top of each bar
for container in ax.containers:
ax.bar_label(container)
plt.title('Loan Acceptance by Education Level')
plt.xlabel('Education Level (1:Undergrad, 2:Graduate, 3:Advanced)')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['Not Accepted', 'Accepted'])
plt.show()
Observations 🔍
Key Insight for Decision Tree ⚡:
# get idea about skew of mortgage
df['Mortgage'].skew()
2.1040023191079444
plt.figure(figsize=(15, 6))
# Box plot to show mortgage distribution by loan status
sns.boxplot(data=df,
x='Personal_Loan',
y='Mortgage',
order=[0, 1])
plt.title('Mortgage Distribution by Loan Status')
plt.xlabel('Personal Loan (0: Not Accepted, 1: Accepted)')
plt.ylabel('Mortgage Amount (thousand dollars)')
plt.show()
plt.figure(figsize=(15, 6))
sns.violinplot(data=df,
x='Personal_Loan',
y='Mortgage',
order=[0, 1])
plt.title('Mortgage Distribution by Loan Status (Violin Plot)')
plt.xlabel('Personal Loan (0: Not Accepted, 1: Accepted)')
plt.ylabel('Mortgage Amount (thousand dollars)')
plt.show()
Heavy tail (ie many outliers)
# fraction of people who didnt took loan but have mortgage > 0
df[(df['Personal_Loan'] == 0) & (df['Mortgage'] > 0)].shape[0] / df[df['Mortgage'] > 0].shape[0]
0.8907672301690507
# fraction of people who took loan but have mortgage > 0
df[(df['Personal_Loan'] == 1) & (df['Mortgage'] > 0)].shape[0] / df[df['Mortgage'] > 0].shape[0]
0.10923276983094929
# fraction of people who didnt took loan when mortgage is > 500
df[(df['Personal_Loan'] == 0) & (df['Mortgage'] > 500)].shape[0] / df[df['Mortgage'] > 500].shape[0]
0.36
# fraction of people who took loan when mortgage is > 500
df[(df['Personal_Loan'] == 1) & (df['Mortgage'] > 500)].shape[0] / df[df['Mortgage'] > 500].shape[0]
0.64
# if person accept loan, their mortgage stats
tb_describe(df[df['Personal_Loan'] == 1]['Mortgage'])
count mean std min 25% 50% 75% max -------- ------- ------ ------ ----- ----- ----- ------ ------ Mortgage 480.00 100.85 160.85 0.00 0.00 0.00 192.50 617.00
Observations:
Insights for Decision Tree:
This suggests having a mortgage may make customers less likely to take personal loans
# personal loan acceptance when possess securities account
df[df['Securities_Account'] == 1]['Personal_Loan'].value_counts()
0 462 1 60 Name: Personal_Loan, dtype: int64
# personal loan acceptance when not possess securities account
df[df['Securities_Account'] == 0]['Personal_Loan'].value_counts()
0 4058 1 420 Name: Personal_Loan, dtype: int64
plt.figure(figsize=(15, 6))
#cross-tabulation
(pd.crosstab(df['Securities_Account'], df['Personal_Loan'], normalize='index') * 100).plot(
kind='bar',
stacked=True,
rot=0 # Set rotation to 0 to keep x-ticks horizontal
)
plt.title('Loan Acceptance Rate by Securities Account Status')
plt.xlabel('Has Securities Account')
plt.ylabel('Proportion')
plt.legend(title='Personal Loan', labels=['Not Accepted', 'Accepted'])
# Add percentage labels
for c in plt.gca().containers:
plt.bar_label(c, fmt='%.1f%%', label_type='center')
plt.show()
<Figure size 1500x600 with 0 Axes>
Observations:
plt.figure(figsize=(10, 6))
# Create cross-tabulation with percentages
(pd.crosstab(df['Online'], df['Personal_Loan'], normalize='index') * 100).plot(
kind='bar',
stacked=True,
rot=0
)
plt.title('Loan Acceptance Rate by Online Banking Status')
plt.xlabel('Uses Online Banking')
plt.ylabel('Proportion')
plt.legend(title='Personal Loan', labels=['Not Accepted', 'Accepted'])
# Add percentage labels
for c in plt.gca().containers:
plt.bar_label(c, fmt='%.1f%%', label_type='center')
plt.show()
<Figure size 1000x600 with 0 Axes>
# fraction of customer who use online banking and accepted loan
val = df[(df['Online'] == 1) & (df['Personal_Loan'] == 1)].shape[0] / df[df['Online'] == 1].shape[0]
to_percentage(val)
'9.8%'
# fraction of customer who dont use online banking and accepted loan
val = df[(df['Online'] == 0) & (df['Personal_Loan'] == 1)].shape[0] / df[df['Online'] == 0].shape[0]
to_percentage(val)
'9.4%'
Points 📌
Observations 🔍
# Scatter plot of CCAvg vs Mortgage colored by Personal Loan
plt.figure(figsize=(15, 6))
sns.scatterplot(data=df,
x='CCAvg',
y='Mortgage',
hue='Personal_Loan',
alpha=0.6)
plt.title('Credit Card Spending vs Mortgage by Loan Status')
plt.xlabel('Average CC Spending (thousand dollars)')
plt.ylabel('Mortgage Amount (thousand dollars)')
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=['Not Accepted', 'Accepted'], title='Personal Loan')
plt.show()
Observation :
df['Mortgage'].describe()
count 5000.00 mean 56.50 std 101.71 min 0.00 25% 0.00 50% 0.00 75% 101.00 max 635.00 Name: Mortgage, dtype: float64
# people who have low mortgage and high spending on credit card
df[(100 <= df['Mortgage']) & (df['Mortgage'] <= 500) & (df['CCAvg'] > 1)]['Personal_Loan'].value_counts()
0 659 1 123 Name: Personal_Loan, dtype: int64
Approximately 20% of them accepted the personal loan, while having medium house loans
# people who have high mortgage and high spending on credit card
df[ (df['Mortgage'] > 500) & (df['CCAvg'] > 1)]['Personal_Loan'].value_counts()
1 15 0 8 Name: Personal_Loan, dtype: int64
50% of people with high mortagage and high spending on credit card choose to go for Personal Loan
# people who have low mortgage and low spending on credit card
df[(df['Mortgage'] < 100) & (df['CCAvg'] < 1)]['Personal_Loan'].value_counts()
0 1244 1 31 Name: Personal_Loan, dtype: int64
NOTE:
We focused on visualizations that directly aid in understanding potential decision splits, rather than using methods like HeatMaps which might not offer the same clarity for decision tree modeling.
Why ?
As earlier in analysis part, if df got modified, we dont want such alterations to start with preprocessing
df = df_original.copy()
# verify that df is different ie copied of original and will not modify original by any means
id(df) == id(df_original)
False
Earlier we saw experience has -ve values
# people with -ve experience
neg_exp_df = df[df['Experience'] < 0]
neg_exp_df.shape[0]
52
There are 52 people with -ve experience
Idea & Explore 🧠:
Generally, people gained experience as they aged
If Experience is mistakenly recorded with a negative sign (for example, because of a data entry error), you would expect that Experience and Age might show a strong negative correlation (the older the person, the less experience they have) or weak positive relation (this maybe due to less observations) (ie experience increases with age but due to -ve sign its weakened).
In an ideal scenario, Experience and Age should show a positive correlation: as a person's age increases, their work experience should typically increase. If the correlation is strongly negative or close to zero, it might suggest that something went wrong in the data entry (i.e., the negative sign was added by mistake).
# Overall correlations
print("Overall correlations:")
print(df[['Age', 'Experience', 'Income']].corr())
# Correlations for positive experience cases
print("\nCorrelations for positive experience cases:")
print(df[df['Experience'] >= 0][['Age', 'Experience', 'Income']].corr())
# Correlations for negative experience cases
print("\nCorrelations for negative experience cases:")
print(df[df['Experience'] < 0][['Age', 'Experience', 'Income']].corr())
Overall correlations:
Age Experience Income
Age 1.00 0.99 -0.06
Experience 0.99 1.00 -0.05
Income -0.06 -0.05 1.00
Correlations for positive experience cases:
Age Experience Income
Age 1.00 0.99 -0.06
Experience 0.99 1.00 -0.05
Income -0.06 -0.05 1.00
Correlations for negative experience cases:
Age Experience Income
Age 1.00 0.31 -0.07
Experience 0.31 1.00 -0.14
Income -0.07 -0.14 1.00
# Create scatter plots
plt.figure(figsize=(15, 5))
# Plot 1: Age vs Experience for negative cases
plt.subplot(1, 3, 1)
plt.scatter(neg_exp_df['Age'], neg_exp_df['Experience'])
plt.xlabel('Age')
plt.ylabel('Experience')
plt.title('Age vs Experience\n(Negative Experience Cases)')
# Plot 2: Income vs Experience for negative cases
plt.subplot(1, 3, 2)
plt.scatter(neg_exp_df['Income'], neg_exp_df['Experience'])
plt.xlabel('Income')
plt.ylabel('Experience')
plt.title('Income vs Experience\n(Negative Experience Cases)')
# Plot 3: Age vs Income for negative cases
plt.subplot(1, 3, 3)
plt.scatter(neg_exp_df['Age'], neg_exp_df['Income'])
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs Income\n(Negative Experience Cases)')
plt.tight_layout()
plt.show()
# Print summary statistics for negative experience cases
print("\nSummary of negative experience cases:")
print(neg_exp_df[['Age', 'Experience', 'Income']].describe())
Summary of negative experience cases:
Age Experience Income
count 52.00 52.00 52.00
mean 24.52 -1.44 69.94
std 1.48 0.64 37.96
min 23.00 -3.00 12.00
25% 24.00 -2.00 40.75
50% 24.00 -1.00 65.50
75% 25.00 -1.00 86.75
max 29.00 -1.00 150.00
Moreover for income,
As -ve experience ranges from 1 to 3 & customers seems to be entry-level, this go hand in hand and insinuate that -ve sign was added by mistake because generally entry-level is associated with 1 to 3 levels of experience
Moreover The income of individuals with negative experience is in line with entry-level salaries, ranging from 40K to 80K or higher, which is consistent with people having 1-2 years of experience (but not negative experience).
There is a reasonable assumption that the negative sign might have been a data entry mistake, where a user or system mistakenly entered -1 instead of 1, and similarly for other negative values
Hence, By applying abs(), you are essentially correcting a data entry error
# Take absolute value of Experience
df['Experience'] = df['Experience'].abs()
# Verify the change
tb_describe(df['Experience'])
count mean std min 25% 50% 75% max ---------- ------- ------ ----- ----- ----- ----- ----- ----- Experience 5000.00 20.13 11.42 0.00 10.00 20.00 30.00 43.00
(df['Experience'] < 0).sum()
0
Thus no experience with negative value (attained)
Why ZIP Code is discarded ?
No Direct Influence on Loan Acceptance – ZIP codes represent geographic locations, but they don’t directly impact a customer's financial behavior or likelihood of accepting a loan.
Too Granular & High Cardinality – Since ZIP codes are categorical but unique to regions, they create too many categories with little predictive power.
Privacy & Ethical Concerns – Using ZIP codes for loan predictions may introduce bias based on location demographics rather than individual financial behavior.
Not Useful for Marketing – The bank’s goal is to predict who will take a loan, not where they live. Location-based targeting might be relevant in some cases, but it's unlikely to be the key driver in personal loan acceptance.
Why ID discarded ?
ID is just a unique identifier (ie Nominal value) with no predictive value
# Drop ID and ZIP code columns
df = df.drop(['ID', 'ZIPCode'], axis=1)
Education
There is natural hierarchy/progression in education levels
Therefore lets treat Education as an ordinal data
The natural ordering could be informative for predictive loan acceptance
Feature Engineering
Should we do OHE for { Securities_Account, CD_Account, Online, CreditCard } as all of them are categorical in nature ?
-> Theoritically yes, but here not required as all of them are binary (ie either 0 0r 1)
Why ??
For binary variables, there's no risk of ordinal interpretation since there are only two values. The classifier will effectively treat them as two distinct categories regardless of whether they're 0/1 or one-hot encoded.
Outlier
# Check outliers in numerical columns using boxplots statistics
numerical_cols = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
# Calculate outlier statistics using IQR method
outlier_stats = {}
for col in numerical_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
outlier_stats[col] = {
'outlier_count': len(outliers),
'outlier_percentage': (len(outliers) / len(df)) * 100,
'min': df[col].min(),
'max': df[col].max()
}
# Display outlier statistics
for col, stats in outlier_stats.items():
print(f"\n{col}:")
print(f"Number of outliers: {stats['outlier_count']}")
print(f"Percentage of outliers: {stats['outlier_percentage']:.2f}%")
print(f"Range: {stats['min']} to {stats['max']}")
Age: Number of outliers: 0 Percentage of outliers: 0.00% Range: 23 to 67 Experience: Number of outliers: 0 Percentage of outliers: 0.00% Range: 0 to 43 Income: Number of outliers: 96 Percentage of outliers: 1.92% Range: 8 to 224 CCAvg: Number of outliers: 324 Percentage of outliers: 6.48% Range: 0.0 to 10.0 Mortgage: Number of outliers: 291 Percentage of outliers: 5.82% Range: 0 to 635
df_original.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.00 | 2500.50 | 1443.52 | 1.00 | 1250.75 | 2500.50 | 3750.25 | 5000.00 |
| Age | 5000.00 | 45.34 | 11.46 | 23.00 | 35.00 | 45.00 | 55.00 | 67.00 |
| Experience | 5000.00 | 20.10 | 11.47 | -3.00 | 10.00 | 20.00 | 30.00 | 43.00 |
| Income | 5000.00 | 73.77 | 46.03 | 8.00 | 39.00 | 64.00 | 98.00 | 224.00 |
| ZIPCode | 5000.00 | 93169.26 | 1759.46 | 90005.00 | 91911.00 | 93437.00 | 94608.00 | 96651.00 |
| Family | 5000.00 | 2.40 | 1.15 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 |
| CCAvg | 5000.00 | 1.94 | 1.75 | 0.00 | 0.70 | 1.50 | 2.50 | 10.00 |
| Education | 5000.00 | 1.88 | 0.84 | 1.00 | 1.00 | 2.00 | 3.00 | 3.00 |
| Mortgage | 5000.00 | 56.50 | 101.71 | 0.00 | 0.00 | 0.00 | 101.00 | 635.00 |
| Personal_Loan | 5000.00 | 0.10 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| Securities_Account | 5000.00 | 0.10 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| CD_Account | 5000.00 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| Online | 5000.00 | 0.60 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| CreditCard | 5000.00 | 0.29 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
⚡ Key Findings
Age (23-67 years):
Experience (-3 to 43 years):
Income (8K-224K):
CCAvg (0-10K monthly):
Mortgage (0-635K):
Summary: Only negative experience values need treatment. Other outliers are valid data points.
The Decision tree naturally handle outliers through their splitting mechanism. Unlike algorithms like linear regression or k-means clustering, trees don't get distorted by extreme values. The tree will simply create split that isolate these outlier values when needed
Hence we can save processing time & skip such step:
df_processed = df.copy()
SEED = 42
# a function to compute different metrics to check performance of a classification model built using sklearn
def get_classification_metrics(y, y_pred):
"""
Function to compute different metrics to check classification model performance
Parameters:
y: dependent variable/ground truth labels
y_pred: predicted target values
Returns:
pd.DataFrame containing model performance metrics (Accuracy, Recall, Precision, F1)
"""
# compute various metrics
acc = accuracy_score(y, y_pred)
recall = recall_score(y, y_pred)
precision = precision_score(y, y_pred)
f1 = f1_score(y, y_pred)
# intentionally commented below 2 lines as we will focus on main 4 metrics once model is built
#f1_weighted = f1_score(y, y_pred, average='weighted')
#recall_weighted = recall_score(y, y_pred, average='weighted')
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
#"F1_Weighted": f1_weighted,
#"Recall_Weighted": recall_weighted
},
index=[0]
)
return df_perf
def plot_confusion_matrix(y, y_pred):
"""
To plot the confusion_matrix with percentages
y: true target values
y_pred: predicted target values
"""
# Compute the confusion matrix comparing the true target values with the predicted values
cm = confusion_matrix(y, y_pred)
# Create labels for each cell in the confusion matrix with both count and percentage
total = cm.flatten().sum()
labels = np.asarray(
[
[f"{item:0.0f}\n{item/total:.2%}"]
for item in cm.flatten()
]
).reshape(2, 2) # reshaping to a matrix
# Set the figure size for the plot
plt.figure(figsize=(6, 4))
# Plot the confusion matrix as a heatmap with the labels
sns.heatmap(cm, annot=labels, fmt="", cmap="RdPu")
# Add a label to the y-axis
plt.ylabel("True label")
# Add a label to the x-axis
plt.xlabel("Predicted label")
def get_tree_stats(tree):
"""Helper function to get tree complexity statistics
Args:
tree: Fitted decision tree model
Returns:
None - prints tree statistics
"""
print("Tree Statistics:")
print(f"Number of nodes: {tree.tree_.node_count}")
print(f"Tree depth: {tree.get_depth()}")
def get_feature_importances(model, X):
"""
Get feature importances from a trained model.
Args:
model: Trained model with feature_importances_ attribute
X: Features dataframe used for training
Returns:
DataFrame with feature names and their importance scores, sorted by importance
"""
importances = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
return importances
def plot_decision_tree(model, X, figsize=(20,20), fontsize=9):
"""Helper function to plot a decision tree with formatted arrows
Args:
model: Fitted decision tree model
X: Features dataframe used for training
figsize: Figure size tuple (width, height)
fontsize: Font size for node text
Returns:
None - displays the plot
"""
# get feature names
feature_names = list(X.columns)
# create figure
plt.figure(figsize=figsize)
# plot decision tree
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=fontsize,
node_ids=False,
class_names=None
)
# format arrows
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
def print_tree_rules(model, feature_names):
"""Helper function to print a text report showing the rules of a decision tree
Args:
model: Fitted decision tree model
feature_names: List of feature names used in the model
Returns:
None - prints the tree rules
"""
print(
tree.export_text(
model,
feature_names=feature_names,
show_weights=True
)
)
def plot_feature_importance(importances, figsize=(10, 6), top_n=None):
"""
Plot feature importances as a horizontal bar chart.
Parameters:
importances: DataFrame with 'feature' and 'importance' columns
figsize: tuple, size of the figure (width, height)
top_n: int, optional number of top features to show (shows all if None)
"""
# Create copy to avoid modifying original
df = importances.copy()
# Limit to top N features if specified
if top_n is not None:
df = df.head(top_n)
# Sort by importance
df = df.sort_values('importance', ascending=True)
# Create plot
plt.figure(figsize=figsize)
# Create horizontal bar plot
bars = plt.barh(df['feature'], df['importance'])
# Add value labels on the bars
for bar in bars:
width = bar.get_width()
plt.text(width, bar.get_y() + bar.get_height()/2,
f'{width:.3f}',
ha='left', va='center', fontsize=10)
# Customize plot
plt.title('Feature Importance')
plt.xlabel('Importance Score')
# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()
def get_preprunned_dt_classifier(X_train, X_test, y_train, y_test, scorer=None, should_balance_target=False):
"""
Implement pre-pruning on decision tree using GridSearchCV
Returns the best model and results visualization
"""
# Define the parameter grid
param_grid = {
"max_depth": [5, 8, 10, 12, 15],
"min_samples_split": [5, 10, 15, 20],
"min_samples_leaf": [2, 4, 6, 8],
"max_leaf_nodes": [50, 75, 100, 125],
}
# Create base model
dt = DecisionTreeClassifier(
random_state=SEED,
class_weight="balanced" if should_balance_target else None,
)
# Implement GridSearchCV
grid_search = GridSearchCV(
estimator=dt,
param_grid=param_grid,
cv=5,
scoring=scorer,
n_jobs=-1,
verbose=1,
# ? we can get best fit params after performing grid search
return_train_score=True
)
print("Building model ...")
# Fit the model
grid_search.fit(X_train, y_train)
# print model build using scoring strategy
print(f"Model built using scoring strategy: {grid_search.scoring}")
# Get best model
best_model = grid_search.best_estimator_
# Get all results
cv_results = pd.DataFrame(grid_search.cv_results_)
# Calculate feature importance
feature_importance = pd.DataFrame(
{
"feature": (
X_train.columns
if hasattr(X_train, "columns")
else [f"Feature_{i}" for i in range(X_train.shape[1])]
),
"importance": best_model.feature_importances_,
}
)
feature_importance = feature_importance.sort_values("importance", ascending=False)
return {
"best_model": best_model,
"best_params": grid_search.best_params_,
"best_score": grid_search.best_score_,
"train_score": best_model.score(X_train, y_train),
"test_score": best_model.score(X_test, y_test),
"cv_results": cv_results,
"feature_importance": feature_importance,
}
def plot_preprunning_results(results):
"""
Plot the results of pre-pruning analysis ie using `feature_importance` & `cv_results`
"""
# Create figure with subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Plot feature importance
sns.barplot(
data=results["feature_importance"].head(10), x="importance", y="feature", ax=ax1
)
ax1.set_title("Top 10 Feature Importance")
ax1.set_xlabel("Importance")
ax1.set_ylabel("Features")
# Plot training vs validation scores
cv_results = results["cv_results"]
ax2.scatter(
cv_results["mean_test_score"], cv_results["mean_train_score"], alpha=0.5
)
ax2.plot([0, 1], [0, 1], "--k")
ax2.set_xlabel("Validation Score")
ax2.set_ylabel("Training Score")
ax2.set_title("Training vs Validation Scores")
plt.tight_layout()
df_processed.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 dtypes: float64(1), int64(11) memory usage: 468.9 KB
X = df_processed.drop('Personal_Loan', axis=1)
y = df_processed['Personal_Loan']
# perform train test split (80:20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Data
# total samples in train data
print(f"Total samples in train data: {X_train.shape[0]} = (80%)")
Total samples in train data: 4000 = (80%)
# Test Data
print(f"Total samples in test data: {X_test.shape[0]} = (20%)")
Total samples in test data: 1000 = (20%)
# target class distribution
print("Target class distribution in train data:")
print(y_train.value_counts())
print(y_train.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
print("\nTarget class distribution in test data:")
print(y_test.value_counts())
print(y_test.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
Target class distribution in train data: 0 3625 1 375 Name: Personal_Loan, dtype: int64 0 90.62% 1 9.38% Name: Personal_Loan, dtype: object Target class distribution in test data: 0 895 1 105 Name: Personal_Loan, dtype: int64 0 89.5% 1 10.5% Name: Personal_Loan, dtype: object
Seems 90 : 10 distribution of target class (ie Personal Loan Acceptance).
So, for every person who accepts the loan (1), there are about 9 people who do not accept the loan (0)
NOTE: We will do following for each modelling experiment
# classifier
dt_default = DecisionTreeClassifier(random_state=SEED)
# fit the model
dt_default.fit(X_train, y_train)
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=42)
dt_default.get_params()
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 42,
'splitter': 'best'}
# predict on train data
y_train_pred = dt_default.predict(X_train)
# predict on test data
y_test_pred = dt_default.predict(X_test)
Training Evaluation
# training performance metrics
dt_default_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_default_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# training confusion matrix
plot_confusion_matrix(y_train, y_train_pred)
Test Evaluation
# testing performance metrics
dt_default_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_default_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.93 | 0.95 | 0.94 |
It seems to overfit as training results are too good
# testing confusion matrix
plot_confusion_matrix(y_test, y_test_pred)
# evaluate the model
print(classification_report(y_test, y_test_pred))
precision recall f1-score support
0 0.99 0.99 0.99 895
1 0.95 0.93 0.94 105
accuracy 0.99 1000
macro avg 0.97 0.96 0.97 1000
weighted avg 0.99 0.99 0.99 1000
# Check tree complexity
get_tree_stats(dt_default)
Tree Statistics: Number of nodes: 125 Tree depth: 13
X_train.columns
Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard'],
dtype='object')
# Get and display feature importances
importances_default = get_feature_importances(dt_default, X_train)
print("\nFeature Importances:")
print(importances_default)
Feature Importances:
feature importance
5 Education 0.38
2 Income 0.30
3 Family 0.18
4 CCAvg 0.05
1 Experience 0.03
0 Age 0.02
9 Online 0.02
8 CD_Account 0.01
6 Mortgage 0.01
10 CreditCard 0.00
7 Securities_Account 0.00
plot_decision_tree(dt_default, X_train)
print_tree_rules(dt_default, X_train.columns)
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2892.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Education <= 1.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Education > 1.50 | | | | |--- CCAvg <= 1.65 | | | | | |--- Online <= 0.50 | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- Online > 0.50 | | | | | | |--- Experience <= 13.00 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- Experience > 13.00 | | | | | | | |--- Mortgage <= 206.00 | | | | | | | | |--- Education <= 2.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Education > 2.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- Mortgage > 206.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- CCAvg > 1.65 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- Income <= 108.50 | | | | | | | |--- Experience <= 25.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Experience > 25.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Income > 108.50 | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | |--- Age <= 48.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 48.00 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.75 | | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- Family <= 1.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- Family > 1.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 82.50 | | | | |--- Age <= 28.00 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 28.00 | | | | | |--- Experience <= 8.50 | | | | | | |--- Experience <= 6.50 | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | |--- Experience > 6.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Experience > 8.50 | | | | | | |--- Income <= 81.50 | | | | | | | |--- Mortgage <= 216.50 | | | | | | | | |--- Experience <= 18.50 | | | | | | | | | |--- Age <= 43.50 | | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | | |--- Age > 43.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Experience > 18.50 | | | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | | |--- Mortgage > 216.50 | | | | | | | | |--- Family <= 3.50 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- Family > 3.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Income > 81.50 | | | | | | | |--- Age <= 52.00 | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- Age > 52.00 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Income > 82.50 | | | | |--- Family <= 2.50 | | | | | |--- Experience <= 33.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- Experience > 3.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- Education <= 2.00 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- Education > 2.00 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- Education <= 1.50 | | | | | | | | | |--- weights: [42.00, 0.00] class: 0 | | | | | | | | |--- Education > 1.50 | | | | | | | | | |--- Income <= 109.50 | | | | | | | | | | |--- Age <= 31.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- Age > 31.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- Income > 109.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Experience > 33.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- CCAvg <= 3.60 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.60 | | | | | | | | |--- Income <= 97.00 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- Income > 97.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Age <= 57.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- Income <= 89.00 | | | | | | | | |--- Age <= 28.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 28.00 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Income > 89.00 | | | | | | | | |--- Age <= 49.50 | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | |--- Age > 49.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Age > 57.00 | | | | | | |--- Mortgage <= 284.00 | | | | | | | |--- CCAvg <= 3.20 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.20 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | |--- Mortgage > 284.00 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 3.85 | | | | |--- weights: [0.00, 8.00] class: 1 | | | |--- CCAvg > 3.85 | | | | |--- Mortgage <= 81.00 | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- Mortgage > 81.00 | | | | | |--- CreditCard <= 0.50 | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | |--- CreditCard > 0.50 | | | | | | |--- Income <= 93.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- Income > 93.50 | | | | | | | |--- Income <= 105.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Income > 105.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [463.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 59.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- Experience <= 32.00 | | | | |--- CCAvg <= 2.80 | | | | | |--- Experience <= 18.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- Experience > 18.00 | | | | | | |--- Education <= 2.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Education > 2.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- CCAvg > 2.80 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- Experience > 32.00 | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 237.00] class: 1
Results are quite surprising with default model
Lets check target variable imbalancedness
# Check target variable imbalancedness
df_original['Personal_Loan'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
0 90.4% 1 9.6% Name: Personal_Loan, dtype: object
NOTE 📌
Data is highly imbalanced (90% vs 10%). This makes high metrics bit suspicious.
With such imbalanced data:
High Accuracy (99%) can be misleading because:
The high Recall (93%), Precision(95%), F1(94%) for imbalanced data suggests:
So Let's try with weightage
# tree with balanced class weights
# classifier
dt_balanced = DecisionTreeClassifier(
random_state=SEED,
class_weight='balanced' # Simply use 'balanced' instead of explicit weights
)
# train the model
dt_balanced.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight='balanced', random_state=42)
dt_balanced.get_params()
{'ccp_alpha': 0.0,
'class_weight': 'balanced',
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': 42,
'splitter': 'best'}
# predict on train data
y_train_pred = dt_balanced.predict(X_train)
# predict on test data
y_test_pred = dt_balanced.predict(X_test)
Training Evaluation
# get performance metrics
dt_balanced_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_balanced_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# get confusion matrix
plot_confusion_matrix(y_train, y_train_pred)
Testing Evaluation
# get performance metrics
dt_balanced_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_balanced_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.92 | 0.96 | 0.94 |
This is also seems overfitted
# get confusion matrix
plot_confusion_matrix(y_test, y_test_pred)
# evaluate the model
print(classification_report(y_test, y_test_pred))
precision recall f1-score support
0 0.99 1.00 0.99 895
1 0.96 0.92 0.94 105
accuracy 0.99 1000
macro avg 0.98 0.96 0.97 1000
weighted avg 0.99 0.99 0.99 1000
Model is doing quite well ⚡
# Check tree complexity
get_tree_stats(dt_balanced)
Tree Statistics: Number of nodes: 175 Tree depth: 18
X_train.columns
Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard'],
dtype='object')
# Get and display feature importances
importances_balanced = get_feature_importances(dt_balanced, X_train)
print("\nFeature Importances:")
print(importances_balanced)
Feature Importances:
feature importance
2 Income 0.61
3 Family 0.15
4 CCAvg 0.09
5 Education 0.09
1 Experience 0.02
0 Age 0.02
8 CD_Account 0.01
10 CreditCard 0.00
6 Mortgage 0.00
9 Online 0.00
7 Securities_Account 0.00
plot_decision_tree(dt_balanced, X_train)
print_tree_rules(dt_balanced, X_train.columns)
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1522.76, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CCAvg <= 4.20 | | | |--- Income <= 81.50 | | | | |--- Experience <= 8.50 | | | | | |--- Family <= 3.50 | | | | | | |--- weights: [0.00, 26.67] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [3.86, 0.00] class: 0 | | | | |--- Experience > 8.50 | | | | | |--- Experience <= 18.50 | | | | | | |--- Age <= 42.50 | | | | | | | |--- weights: [12.14, 0.00] class: 0 | | | | | | |--- Age > 42.50 | | | | | | | |--- CCAvg <= 3.60 | | | | | | | | |--- weights: [0.00, 10.67] class: 1 | | | | | | | |--- CCAvg > 3.60 | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | |--- Experience > 18.50 | | | | | | |--- weights: [19.86, 0.00] class: 0 | | | |--- Income > 81.50 | | | | |--- Age <= 46.00 | | | | | |--- Experience <= 3.50 | | | | | | |--- Income <= 91.50 | | | | | | | |--- weights: [0.00, 10.67] class: 1 | | | | | | |--- Income > 91.50 | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | |--- Experience > 3.50 | | | | | | |--- CreditCard <= 0.50 | | | | | | | |--- weights: [8.83, 0.00] class: 0 | | | | | | |--- CreditCard > 0.50 | | | | | | | |--- CCAvg <= 3.70 | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | | | |--- CCAvg > 3.70 | | | | | | | | |--- Experience <= 17.50 | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | | |--- Experience > 17.50 | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | |--- Age > 46.00 | | | | | |--- CCAvg <= 3.05 | | | | | | |--- Income <= 91.50 | | | | | | | |--- weights: [2.76, 0.00] class: 0 | | | | | | |--- Income > 91.50 | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | |--- CCAvg > 3.05 | | | | | | |--- Income <= 90.50 | | | | | | | |--- Mortgage <= 142.50 | | | | | | | | |--- Age <= 63.50 | | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | | |--- weights: [0.00, 42.67] class: 1 | | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | | |--- Mortgage <= 48.50 | | | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | | | | |--- Mortgage > 48.50 | | | | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | | | | |--- Age > 63.50 | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | |--- Mortgage > 142.50 | | | | | | | | |--- Age <= 64.50 | | | | | | | | | |--- weights: [1.66, 0.00] class: 0 | | | | | | | | |--- Age > 64.50 | | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | | |--- Income > 90.50 | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | |--- CCAvg > 4.20 | | | |--- Age <= 34.00 | | | | |--- weights: [0.55, 0.00] class: 0 | | | |--- Age > 34.00 | | | | |--- weights: [15.45, 0.00] class: 0 |--- Income > 92.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- CD_Account <= 0.50 | | | | |--- Income <= 99.50 | | | | | |--- Income <= 98.50 | | | | | | |--- Income <= 93.50 | | | | | | | |--- weights: [7.72, 0.00] class: 0 | | | | | | |--- Income > 93.50 | | | | | | | |--- weights: [13.79, 0.00] class: 0 | | | | | |--- Income > 98.50 | | | | | | |--- CCAvg <= 4.20 | | | | | | | |--- Family <= 1.50 | | | | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | | | | |--- Family > 1.50 | | | | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | | | |--- CCAvg > 4.20 | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | |--- Income > 99.50 | | | | | |--- CreditCard <= 0.50 | | | | | | |--- weights: [208.55, 0.00] class: 0 | | | | | |--- CreditCard > 0.50 | | | | | | |--- weights: [75.03, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- Income <= 107.00 | | | | | |--- Age <= 52.00 | | | | | | |--- CCAvg <= 2.61 | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | |--- CCAvg > 2.61 | | | | | | | |--- weights: [0.00, 16.00] class: 1 | | | | | |--- Age > 52.00 | | | | | | |--- CCAvg <= 4.60 | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | |--- CCAvg > 4.60 | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | |--- Income > 107.00 | | | | | |--- Experience <= 9.50 | | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | | |--- Experience > 9.50 | | | | | | |--- weights: [8.83, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- Family <= 3.50 | | | | | |--- CCAvg <= 3.25 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [5.52, 0.00] class: 0 | | | | | |--- CCAvg > 3.25 | | | | | | |--- Age <= 46.50 | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | | |--- Age > 46.50 | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | |--- weights: [2.21, 0.00] class: 0 | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- CCAvg <= 4.50 | | | | | | | |--- weights: [0.00, 26.67] class: 1 | | | | | | |--- CCAvg > 4.50 | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | |--- Income > 113.50 | | | | |--- Income <= 116.50 | | | | | |--- weights: [0.00, 10.67] class: 1 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 304.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- Income <= 106.50 | | | | | |--- weights: [40.83, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- Experience > 3.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- Mortgage <= 354.50 | | | | | | | | | |--- CCAvg <= 2.83 | | | | | | | | | | |--- CCAvg <= 0.30 | | | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 0.30 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | |--- CCAvg > 2.83 | | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | | |--- Mortgage > 354.50 | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- Family <= 3.50 | | | | | | | | | |--- weights: [4.41, 0.00] class: 0 | | | | | | | | |--- Family > 3.50 | | | | | | | | | |--- Age <= 34.00 | | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | | | |--- Age > 34.00 | | | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [6.62, 0.00] class: 0 | | | |--- CCAvg > 2.95 | | | | |--- Family <= 2.50 | | | | | |--- Education <= 2.50 | | | | | | |--- weights: [0.00, 32.00] class: 1 | | | | | |--- Education > 2.50 | | | | | | |--- Experience <= 25.50 | | | | | | | |--- Age <= 31.00 | | | | | | | | |--- weights: [0.00, 10.67] class: 1 | | | | | | | |--- Age > 31.00 | | | | | | | | |--- weights: [6.07, 0.00] class: 0 | | | | | | |--- Experience > 25.50 | | | | | | | |--- CCAvg <= 3.30 | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.30 | | | | | | | | |--- weights: [0.00, 21.33] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Experience <= 37.50 | | | | | | |--- Experience <= 35.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.00, 42.67] class: 1 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [0.00, 37.33] class: 1 | | | | | | |--- Experience > 35.50 | | | | | | | |--- Experience <= 36.50 | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | |--- Experience > 36.50 | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | |--- Experience > 37.50 | | | | | | |--- weights: [1.66, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- Mortgage <= 94.50 | | | | | |--- Age <= 57.50 | | | | | | |--- Family <= 1.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- Age <= 38.00 | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | | | |--- Age > 38.00 | | | | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | | | |--- Family > 1.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- weights: [0.00, 32.00] class: 1 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | | |--- Age > 57.50 | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | |--- Mortgage > 94.50 | | | | | |--- weights: [1.10, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- CCAvg <= 0.05 | | | | | |--- weights: [0.00, 5.33] class: 1 | | | | |--- CCAvg > 0.05 | | | | | |--- weights: [0.00, 1258.67] class: 1
# top 4 featurs for default model
importances_default.head(4)
| feature | importance | |
|---|---|---|
| 5 | Education | 0.38 |
| 2 | Income | 0.30 |
| 3 | Family | 0.18 |
| 4 | CCAvg | 0.05 |
# top 4 featurs for balanced model
importances_balanced.head(4)
| feature | importance | |
|---|---|---|
| 2 | Income | 0.61 |
| 3 | Family | 0.15 |
| 4 | CCAvg | 0.09 |
| 5 | Education | 0.09 |
NOTE :
Seems Education should be regarded compared to Family Size for personal loan campaign, as first model converge earlier with less depth & almost equal **performance
NOTE: Once you get bestestimator, you can directly use it to make predictions (ie refit = True)
results = get_preprunned_dt_classifier(X_train, X_test, y_train, y_test, scorer='f1')
Building model ... Fitting 5 folds for each of 320 candidates, totalling 1600 fits Model built using scoring strategy: f1
dt_preprunned_f1 = results['best_model']
# predict on train data
y_train_pred = dt_preprunned_f1.predict(X_train)
# predict on test data
y_test_pred = dt_preprunned_f1.predict(X_test)
Training Evaluation
# get performance metrics
dt_preprunned_f1_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_preprunned_f1_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.93 | 0.98 | 0.96 |
# get confusion matrix
plot_confusion_matrix(y_train, y_train_pred)
Testing Evaluation
# get performance metrics
dt_preprunned_f1_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_preprunned_f1_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.91 | 0.95 | 0.93 |
# get confusion matrix
plot_confusion_matrix(y_test, y_test_pred)
# evaluate the model
print(classification_report(y_test, y_test_pred))
precision recall f1-score support
0 0.99 0.99 0.99 895
1 0.95 0.91 0.93 105
accuracy 0.99 1000
macro avg 0.97 0.95 0.96 1000
weighted avg 0.99 0.99 0.99 1000
# Check tree complexity
get_tree_stats(dt_preprunned_f1)
Tree Statistics: Number of nodes: 81 Tree depth: 10
X_train.columns
Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard'],
dtype='object')
# Get and display feature importances
importances_preprunned_f1 = get_feature_importances(dt_preprunned_f1, X_train)
print("\nFeature Importances:")
print(importances_preprunned_f1)
Feature Importances:
feature importance
5 Education 0.39
2 Income 0.31
3 Family 0.18
4 CCAvg 0.05
0 Age 0.02
8 CD_Account 0.02
9 Online 0.01
1 Experience 0.01
10 CreditCard 0.00
6 Mortgage 0.00
7 Securities_Account 0.00
Top 4 are similar in order as given by default (ie baseline model) without any prunning
plot_decision_tree(dt_preprunned_f1, X_train)
Much better in terms of readability compare to earlier ones
print_tree_rules(dt_preprunned_f1, X_train.columns)
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2892.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Education <= 1.50 | | | | |--- Family <= 2.50 | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | |--- Family > 2.50 | | | | | |--- CCAvg <= 1.05 | | | | | | |--- weights: [3.00, 2.00] class: 0 | | | | | |--- CCAvg > 1.05 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- Education > 1.50 | | | | |--- CCAvg <= 1.65 | | | | | |--- Online <= 0.50 | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- Online > 0.50 | | | | | | |--- Income <= 112.50 | | | | | | | |--- weights: [1.00, 4.00] class: 1 | | | | | | |--- Income > 112.50 | | | | | | | |--- weights: [2.00, 2.00] class: 0 | | | | |--- CCAvg > 1.65 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- Age <= 35.00 | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | | |--- Age > 35.00 | | | | | | | |--- Experience <= 30.00 | | | | | | | | |--- weights: [4.00, 2.00] class: 0 | | | | | | | |--- Experience > 30.00 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- weights: [2.00, 2.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 82.50 | | | | |--- Experience <= 8.50 | | | | | |--- Education <= 2.50 | | | | | | |--- weights: [1.00, 3.00] class: 1 | | | | | |--- Education > 2.50 | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | |--- Experience > 8.50 | | | | | |--- Income <= 81.50 | | | | | | |--- Mortgage <= 216.50 | | | | | | | |--- Experience <= 18.50 | | | | | | | | |--- Age <= 41.50 | | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | | |--- Age > 41.50 | | | | | | | | | |--- weights: [3.00, 1.00] class: 0 | | | | | | | |--- Experience > 18.50 | | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Mortgage > 216.50 | | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | |--- Income > 81.50 | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | |--- Income > 82.50 | | | | |--- Family <= 2.50 | | | | | |--- Experience <= 33.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [2.00, 3.00] class: 1 | | | | | | |--- Experience > 3.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- weights: [4.00, 2.00] class: 0 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- Education <= 1.50 | | | | | | | | | |--- weights: [42.00, 0.00] class: 0 | | | | | | | | |--- Education > 1.50 | | | | | | | | | |--- Income <= 104.00 | | | | | | | | | | |--- weights: [24.00, 2.00] class: 0 | | | | | | | | | |--- Income > 104.00 | | | | | | | | | | |--- weights: [2.00, 2.00] class: 0 | | | | | |--- Experience > 33.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [2.00, 2.00] class: 0 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Age <= 57.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- Income <= 89.00 | | | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | | | |--- Income > 89.00 | | | | | | | | |--- weights: [1.00, 5.00] class: 1 | | | | | |--- Age > 57.00 | | | | | | |--- Mortgage <= 57.50 | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | |--- Mortgage > 57.50 | | | | | | | |--- weights: [2.00, 2.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 3.85 | | | | |--- weights: [0.00, 8.00] class: 1 | | | |--- CCAvg > 3.85 | | | | |--- CreditCard <= 0.50 | | | | | |--- weights: [1.00, 4.00] class: 1 | | | | |--- CreditCard > 0.50 | | | | | |--- weights: [5.00, 2.00] class: 0 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [463.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 59.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- Age <= 57.50 | | | | |--- CCAvg <= 2.80 | | | | | |--- Age <= 43.50 | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | |--- Age > 43.50 | | | | | | |--- weights: [1.00, 4.00] class: 1 | | | | |--- CCAvg > 2.80 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- Age > 57.50 | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 237.00] class: 1
plot_preprunning_results(results)
Key observations 🔍:
Preprunned model doesn't seems to benefit over default model, perf-metrics are almost similar.
NOTE: The depth of tree has plummet compare to baseline (ie 10 instead of 13) & likewise drop in node counts as well, observed for pre-prunned model
# balance of target class
df_original['Personal_Loan'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
0 90.4% 1 9.6% Name: Personal_Loan, dtype: object
Motiviation | Recall (Sensitivity) notion:
Our target variable is highly imbalanced, with only 10% of customers opting in and 90% not opting in. In such cases, misclassifying the minority class (opt-ins) as non-opt-ins can be costly, leading to missed business opportunities.
By prioritizing Recall, we ensure that our model correctly identifies as many actual opt-ins as possible, minimizing False Negatives. While Precision may slightly drop, it is an acceptable trade-off since mistakenly classifying a few non-opt-ins as opt-ins (False Positives) is manageable. This approach aligns with our goal of capturing potential customers effectively.
- Since only about 10% of customers opted for the loan, missing even a small portion of them would mean a significant loss in potential business
Q. Why focused on Recall ?
NOTE: F1-score is also good one, we need to rule out by trying (ie trial N error) & figuring out, instead of too much hypothesis reliance
Business Perspective
So, Recall is indeed one of a suitable metric to optimize for.
Hence let's give recall one shot (# Focus on finding actual loan takers) !!
# Focus on finding actual loan takers
results = get_preprunned_dt_classifier(X_train, X_test, y_train, y_test, scorer='recall')
Building model ... Fitting 5 folds for each of 320 candidates, totalling 1600 fits Model built using scoring strategy: recall
dt_preprunned_recall = results['best_model']
# predict on train data
y_train_pred = dt_preprunned_recall.predict(X_train)
# predict on test data
y_test_pred = dt_preprunned_recall.predict(X_test)
Training Evaluation
# get performance metrics
dt_preprunned_recall_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_preprunned_recall_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.89 | 0.97 | 0.93 |
# get confusion matrix
plot_confusion_matrix(y_train, y_train_pred)
Testing Evaluation
# get performance metrics
dt_preprunned_recall_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_preprunned_recall_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.89 | 0.95 | 0.92 |
Interesting Observation 👀
Compare to earlier preprunned model (ie scoring='f1'), this recall specific model (ie scoring='recall') has lesser recall score
This situtation can occur for several reasons:
Overall it appears that as there are very few people who accepts the loan and total data is not that much in high volume, Recall Oriented Modelling may not capture the essence to generalize the ideal notion !!
# get confusion matrix
plot_confusion_matrix(y_test, y_test_pred)
# evaluate the model
print(classification_report(y_test, y_test_pred))
precision recall f1-score support
0 0.99 0.99 0.99 895
1 0.95 0.89 0.92 105
accuracy 0.98 1000
macro avg 0.97 0.94 0.95 1000
weighted avg 0.98 0.98 0.98 1000
# Check tree complexity
get_tree_stats(dt_preprunned_recall)
Tree Statistics: Number of nodes: 59 Tree depth: 10
X_train.columns
Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard'],
dtype='object')
# Get and display feature importances
importances_preprunned_recall = get_feature_importances(dt_preprunned_recall, X_train)
print("\nFeature Importances:")
print(importances_preprunned_recall)
Feature Importances:
feature importance
5 Education 0.40
2 Income 0.32
3 Family 0.19
4 CCAvg 0.05
8 CD_Account 0.02
0 Age 0.01
1 Experience 0.01
9 Online 0.01
6 Mortgage 0.00
7 Securities_Account 0.00
10 CreditCard 0.00
plot_decision_tree(dt_preprunned_recall, X_train)
print_tree_rules(dt_preprunned_recall, X_train.columns)
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2892.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Education <= 1.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Education > 1.50 | | | | |--- CCAvg <= 1.65 | | | | | |--- weights: [9.00, 6.00] class: 0 | | | | |--- CCAvg > 1.65 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- Income <= 108.50 | | | | | | | |--- weights: [2.00, 1.00] class: 0 | | | | | | |--- Income > 108.50 | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | |--- weights: [4.00, 1.00] class: 0 | | | | | | | |--- CCAvg > 1.75 | | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- weights: [2.00, 2.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 82.50 | | | | |--- Experience <= 8.50 | | | | | |--- weights: [7.00, 3.00] class: 0 | | | | |--- Experience > 8.50 | | | | | |--- Income <= 81.50 | | | | | | |--- Mortgage <= 216.50 | | | | | | | |--- Experience <= 18.50 | | | | | | | | |--- Age <= 42.50 | | | | | | | | | |--- weights: [22.00, 0.00] class: 0 | | | | | | | | |--- Age > 42.50 | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | |--- Experience > 18.50 | | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Mortgage > 216.50 | | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | |--- Income > 81.50 | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | |--- Income > 82.50 | | | | |--- Family <= 2.50 | | | | | |--- Experience <= 33.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [2.00, 3.00] class: 1 | | | | | | |--- Experience > 3.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- weights: [4.00, 2.00] class: 0 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- Education <= 1.50 | | | | | | | | | |--- weights: [42.00, 0.00] class: 0 | | | | | | | | |--- Education > 1.50 | | | | | | | | | |--- Income <= 109.50 | | | | | | | | | | |--- weights: [26.00, 2.00] class: 0 | | | | | | | | | |--- Income > 109.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Experience > 33.50 | | | | | | |--- weights: [6.00, 6.00] class: 0 | | | | |--- Family > 2.50 | | | | | |--- Age <= 57.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [7.00, 6.00] class: 0 | | | | | |--- Age > 57.00 | | | | | | |--- weights: [9.00, 2.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 3.85 | | | | |--- weights: [0.00, 8.00] class: 1 | | | |--- CCAvg > 3.85 | | | | |--- weights: [6.00, 6.00] class: 0 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [463.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 59.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- Age <= 57.50 | | | | |--- weights: [7.00, 11.00] class: 1 | | | |--- Age > 57.50 | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 237.00] class: 1
plot_preprunning_results(results)
Observation 🔍
Though above 2 models (ie F1 & Recall driven) have slight difference in metrics, all in all both of them focuses on similar features in terms of order
i.e
Education > Income > Family > CCAvg
Thus essence of approaching seems to be in right direction !!
Overall Observation 🔍:
As there is imbalance in target heavily, lets try to incorporate weightage averaging (ie as per #samples in each class)
Why ?
Since we have 90% no-loan customers and only 10% loan customers, a naive model could predict "no loan" most of the time and still get high accuracy. Weighted F1-Score ensures the small class (loan customers) is not ignored while maintaining balance.
NOTE: We can use scoring strategy in 2 ways
# Create a custom scorer for weighted F1
#f1_weighted_scorer = make_scorer(f1_score, average='weighted')
# results = get_preprunned_dt_classifier(X_train, X_test, y_train, y_test, scorer=f1_weighted_scorer)
results = get_preprunned_dt_classifier(X_train, X_test, y_train, y_test, scorer='f1_weighted', should_balance_target=True)
Building model ... Fitting 5 folds for each of 320 candidates, totalling 1600 fits Model built using scoring strategy: f1_weighted
dt_preprunned_f1_weighted = results['best_model']
# predict on train data
y_train_pred = dt_preprunned_f1_weighted.predict(X_train)
# predict on test data
y_test_pred = dt_preprunned_f1_weighted.predict(X_test)
Training Evaluation
# get performance metrics
dt_preprunned_f1_weighted_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_preprunned_f1_weighted_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 1.00 | 0.93 | 0.96 |
# get confusion matrix
plot_confusion_matrix(y_train, y_train_pred)
Testing Evaluation
# get performance metrics
dt_preprunned_f1_weighted_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_preprunned_f1_weighted_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.92 | 0.91 | 0.92 |
Better than earlier preprunned models
# get confusion matrix
plot_confusion_matrix(y_test, y_test_pred)
# evaluate the model
print(classification_report(y_test, y_test_pred))
precision recall f1-score support
0 0.99 0.99 0.99 895
1 0.91 0.92 0.92 105
accuracy 0.98 1000
macro avg 0.95 0.96 0.95 1000
weighted avg 0.98 0.98 0.98 1000
# Check tree complexity
get_tree_stats(dt_preprunned_f1_weighted)
Tree Statistics: Number of nodes: 99 Tree depth: 10
# Get and display feature importances
importances_preprunned_f1_weighted = get_feature_importances(dt_preprunned_f1_weighted, X_train)
print("\nFeature Importances:")
print(importances_preprunned_f1_weighted)
Feature Importances:
feature importance
2 Income 0.62
3 Family 0.15
5 Education 0.09
4 CCAvg 0.09
1 Experience 0.02
0 Age 0.01
8 CD_Account 0.01
10 CreditCard 0.00
6 Mortgage 0.00
7 Securities_Account 0.00
9 Online 0.00
plot_decision_tree(dt_preprunned_f1_weighted, X_train)
print_tree_rules(dt_preprunned_f1_weighted, X_train.columns)
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1522.76, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CCAvg <= 4.20 | | | |--- Income <= 81.50 | | | | |--- Experience <= 8.50 | | | | | |--- Family <= 3.50 | | | | | | |--- weights: [0.00, 26.67] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [3.86, 0.00] class: 0 | | | | |--- Experience > 8.50 | | | | | |--- Experience <= 18.50 | | | | | | |--- Age <= 42.50 | | | | | | | |--- weights: [12.14, 0.00] class: 0 | | | | | | |--- Age > 42.50 | | | | | | | |--- weights: [0.55, 10.67] class: 1 | | | | | |--- Experience > 18.50 | | | | | | |--- weights: [19.86, 0.00] class: 0 | | | |--- Income > 81.50 | | | | |--- Age <= 46.00 | | | | | |--- Experience <= 3.50 | | | | | | |--- weights: [0.55, 10.67] class: 1 | | | | | |--- Experience > 3.50 | | | | | | |--- CreditCard <= 0.50 | | | | | | | |--- weights: [8.83, 0.00] class: 0 | | | | | | |--- CreditCard > 0.50 | | | | | | | |--- weights: [1.10, 5.33] class: 1 | | | | |--- Age > 46.00 | | | | | |--- CCAvg <= 3.05 | | | | | | |--- Experience <= 26.50 | | | | | | | |--- weights: [2.21, 0.00] class: 0 | | | | | | |--- Experience > 26.50 | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | | |--- CCAvg > 3.05 | | | | | | |--- Mortgage <= 142.50 | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | |--- weights: [0.00, 42.67] class: 1 | | | | | | | |--- CCAvg > 3.75 | | | | | | | | |--- weights: [1.66, 5.33] class: 1 | | | | | | |--- Mortgage > 142.50 | | | | | | | |--- weights: [1.66, 5.33] class: 1 | | |--- CCAvg > 4.20 | | | |--- weights: [16.00, 0.00] class: 0 |--- Income > 92.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- CD_Account <= 0.50 | | | | |--- Income <= 99.50 | | | | | |--- Income <= 98.50 | | | | | | |--- weights: [21.52, 0.00] class: 0 | | | | | |--- Income > 98.50 | | | | | | |--- Age <= 58.50 | | | | | | | |--- weights: [1.66, 0.00] class: 0 | | | | | | |--- Age > 58.50 | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | |--- Income > 99.50 | | | | | |--- weights: [283.59, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- Income <= 107.00 | | | | | |--- Age <= 52.00 | | | | | | |--- weights: [0.55, 16.00] class: 1 | | | | | |--- Age > 52.00 | | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | |--- Income > 107.00 | | | | | |--- weights: [9.93, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- Family <= 3.50 | | | | | |--- CCAvg <= 3.25 | | | | | | |--- weights: [10.48, 0.00] class: 0 | | | | | |--- CCAvg > 3.25 | | | | | | |--- Experience <= 32.50 | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | | | |--- Experience > 32.50 | | | | | | | |--- weights: [2.21, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- Income <= 102.00 | | | | | | |--- weights: [1.10, 5.33] class: 1 | | | | | |--- Income > 102.00 | | | | | | |--- weights: [0.00, 21.33] class: 1 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 314.67] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- Income <= 106.50 | | | | | |--- weights: [40.83, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- Experience > 3.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- Education <= 2.50 | | | | | | | | | |--- CCAvg <= 1.95 | | | | | | | | | | |--- weights: [2.21, 16.00] class: 1 | | | | | | | | | |--- CCAvg > 1.95 | | | | | | | | | | |--- weights: [1.66, 0.00] class: 0 | | | | | | | | |--- Education > 2.50 | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | |--- weights: [1.66, 10.67] class: 1 | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | |--- weights: [0.55, 32.00] class: 1 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- Family <= 3.50 | | | | | | | | | |--- weights: [4.41, 0.00] class: 0 | | | | | | | | |--- Family > 3.50 | | | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [6.62, 0.00] class: 0 | | | |--- CCAvg > 2.95 | | | | |--- Family <= 2.50 | | | | | |--- Education <= 2.50 | | | | | | |--- weights: [0.00, 32.00] class: 1 | | | | | |--- Education > 2.50 | | | | | | |--- Experience <= 25.50 | | | | | | | |--- Age <= 31.00 | | | | | | | | |--- weights: [0.00, 10.67] class: 1 | | | | | | | |--- Age > 31.00 | | | | | | | | |--- weights: [6.07, 0.00] class: 0 | | | | | | |--- Experience > 25.50 | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | | | | |--- CCAvg > 3.75 | | | | | | | | |--- weights: [0.00, 16.00] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Experience <= 37.50 | | | | | | |--- Experience <= 35.50 | | | | | | | |--- weights: [0.00, 80.00] class: 1 | | | | | | |--- Experience > 35.50 | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | | |--- Experience > 37.50 | | | | | | |--- weights: [1.66, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- Mortgage <= 94.50 | | | | | |--- Family <= 1.50 | | | | | | |--- weights: [1.10, 5.33] class: 1 | | | | | |--- Family > 1.50 | | | | | | |--- Experience <= 25.00 | | | | | | | |--- weights: [0.00, 32.00] class: 1 | | | | | | |--- Experience > 25.00 | | | | | | | |--- weights: [0.55, 5.33] class: 1 | | | | |--- Mortgage > 94.50 | | | | | |--- weights: [1.10, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 1264.00] class: 1
plot_preprunning_results(results)
NOTE 📌
This is by far the most successful preprunned model we saw so far
This model priortize Income & Family for Loan Consideration instead of Education, which seems reasonable if think from layman perspective.
Still top 4 features remains common i.e
Income > Family > Education > CCAvg
Let's try few more combination (just out of curiosity)
Using f1_weighted first, then balanced tree (ie 1 at a time)
# Find best parameters using f1_weighted scoring
results = get_preprunned_dt_classifier(X_train, X_test, y_train, y_test, scorer='f1_weighted')
Building model ... Fitting 5 folds for each of 320 candidates, totalling 1600 fits Model built using scoring strategy: f1_weighted
# Now use best params learned from above grid search experiment and use 'balanced' strategy to fit the best estimator
# (alike in earlier approach where we consider (both during grid search cv itself)
# But here we consider them separately (ie one at a time)
best_params = results['best_params']
dt_preprunned_f1_wei_bal = DecisionTreeClassifier(random_state=SEED, **best_params, class_weight='balanced')
dt_preprunned_f1_wei_bal.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=10, max_leaf_nodes=50,
min_samples_leaf=4, min_samples_split=5,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=10, max_leaf_nodes=50,
min_samples_leaf=4, min_samples_split=5,
random_state=42)y_train_pred = dt_preprunned_f1_wei_bal.predict(X_train)
y_test_pred = dt_preprunned_f1_wei_bal.predict(X_test)
dt_preprunned_f1_wei_bal_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_preprunned_f1_wei_bal_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 1.00 | 0.86 | 0.93 |
dt_preprunned_f1_wei_bal_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_preprunned_f1_wei_bal_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.95 | 0.83 | 0.89 |
NOTE: though we get best Recall till now, but Our Precision Suffer & hence does F1-score too.
So our earlier assumption that focussing too on Recall may bring down precision maybe due to less customer who accepts loan in dataset
get_tree_stats(dt_preprunned_f1_wei_bal)
Tree Statistics: Number of nodes: 97 Tree depth: 10
plot_decision_tree(dt_preprunned_f1_wei_bal, X_train)
print_tree_rules(dt_preprunned_f1_wei_bal, X_train.columns)
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1522.76, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CCAvg <= 4.20 | | | |--- Income <= 81.50 | | | | |--- Experience <= 8.50 | | | | | |--- Family <= 3.50 | | | | | | |--- weights: [0.00, 26.67] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [3.86, 0.00] class: 0 | | | | |--- Experience > 8.50 | | | | | |--- Experience <= 18.50 | | | | | | |--- Age <= 41.50 | | | | | | | |--- weights: [11.03, 0.00] class: 0 | | | | | | |--- Age > 41.50 | | | | | | | |--- weights: [1.66, 10.67] class: 1 | | | | | |--- Experience > 18.50 | | | | | | |--- weights: [19.86, 0.00] class: 0 | | | |--- Income > 81.50 | | | | |--- Age <= 46.00 | | | | | |--- Experience <= 4.50 | | | | | | |--- weights: [1.10, 10.67] class: 1 | | | | | |--- Experience > 4.50 | | | | | | |--- CCAvg <= 3.65 | | | | | | | |--- weights: [2.21, 5.33] class: 1 | | | | | | |--- CCAvg > 3.65 | | | | | | | |--- weights: [7.17, 0.00] class: 0 | | | | |--- Age > 46.00 | | | | | |--- CCAvg <= 3.05 | | | | | | |--- weights: [2.76, 5.33] class: 1 | | | | | |--- CCAvg > 3.05 | | | | | | |--- Mortgage <= 142.50 | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | |--- Age <= 53.00 | | | | | | | | | |--- weights: [0.00, 21.33] class: 1 | | | | | | | | |--- Age > 53.00 | | | | | | | | | |--- weights: [0.00, 21.33] class: 1 | | | | | | | |--- CCAvg > 3.75 | | | | | | | | |--- weights: [1.66, 5.33] class: 1 | | | | | | |--- Mortgage > 142.50 | | | | | | | |--- weights: [1.66, 5.33] class: 1 | | |--- CCAvg > 4.20 | | | |--- Securities_Account <= 0.50 | | | | |--- weights: [13.79, 0.00] class: 0 | | | |--- Securities_Account > 0.50 | | | | |--- weights: [2.21, 0.00] class: 0 |--- Income > 92.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- CD_Account <= 0.50 | | | | |--- Income <= 99.50 | | | | | |--- Income <= 98.50 | | | | | | |--- CCAvg <= 0.25 | | | | | | | |--- weights: [2.76, 0.00] class: 0 | | | | | | |--- CCAvg > 0.25 | | | | | | | |--- weights: [18.76, 0.00] class: 0 | | | | | |--- Income > 98.50 | | | | | | |--- weights: [2.21, 5.33] class: 1 | | | | |--- Income > 99.50 | | | | | |--- weights: [283.59, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- Income <= 107.00 | | | | | |--- weights: [1.66, 16.00] class: 1 | | | | |--- Income > 107.00 | | | | | |--- Income <= 120.00 | | | | | | |--- weights: [2.76, 0.00] class: 0 | | | | | |--- Income > 120.00 | | | | | | |--- weights: [7.17, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- Family <= 3.50 | | | | | |--- CCAvg <= 3.25 | | | | | | |--- CCAvg <= 0.90 | | | | | | | |--- weights: [2.21, 0.00] class: 0 | | | | | | |--- CCAvg > 0.90 | | | | | | | |--- weights: [8.28, 0.00] class: 0 | | | | | |--- CCAvg > 3.25 | | | | | | |--- weights: [2.76, 5.33] class: 1 | | | | |--- Family > 3.50 | | | | | |--- weights: [1.10, 26.67] class: 1 | | | |--- Income > 113.50 | | | | |--- Securities_Account <= 0.50 | | | | | |--- weights: [0.00, 272.00] class: 1 | | | | |--- Securities_Account > 0.50 | | | | | |--- weights: [0.00, 42.67] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- Income <= 106.50 | | | | | |--- weights: [40.83, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- Experience > 3.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- Education <= 2.50 | | | | | | | | | |--- CCAvg <= 1.70 | | | | | | | | | | |--- weights: [1.10, 10.67] class: 1 | | | | | | | | | |--- CCAvg > 1.70 | | | | | | | | | | |--- weights: [2.76, 5.33] class: 1 | | | | | | | | |--- Education > 2.50 | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | |--- weights: [1.66, 10.67] class: 1 | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | |--- weights: [0.55, 32.00] class: 1 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- Family <= 2.50 | | | | | | | | | |--- weights: [3.31, 0.00] class: 0 | | | | | | | | |--- Family > 2.50 | | | | | | | | | |--- weights: [1.66, 5.33] class: 1 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [6.62, 0.00] class: 0 | | | |--- CCAvg > 2.95 | | | | |--- Family <= 2.50 | | | | | |--- Education <= 2.50 | | | | | | |--- weights: [0.00, 32.00] class: 1 | | | | | |--- Education > 2.50 | | | | | | |--- Experience <= 25.50 | | | | | | | |--- CCAvg <= 3.95 | | | | | | | | |--- weights: [1.10, 10.67] class: 1 | | | | | | | |--- CCAvg > 3.95 | | | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | | |--- Experience > 25.50 | | | | | | | |--- weights: [0.55, 21.33] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 80.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [2.21, 5.33] class: 1 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- Family <= 1.50 | | | | | |--- weights: [1.66, 5.33] class: 1 | | | | |--- Family > 1.50 | | | | | |--- Online <= 0.50 | | | | | | |--- weights: [0.00, 21.33] class: 1 | | | | | |--- Online > 0.50 | | | | | | |--- weights: [1.10, 16.00] class: 1 | | | |--- Income > 116.50 | | | | |--- Family <= 1.50 | | | | | |--- weights: [0.00, 346.67] class: 1 | | | | |--- Family > 1.50 | | | | | |--- weights: [0.00, 917.33] class: 1
plot_preprunning_results(results)
This also seems good.
Though it has goo recall score, I would opine to select previous model amongs preprunning model we encountered so far
Basic Idea 💡
Minimal Cost Complexity Pruning
Gist :- It helps prevent overfitting by removing branches that don't significantly improve prediction accuracy
Basic Concept:
Post-prunning examines each node from bottom to top and evaluates whether removing the subtree rooted ar that node (turning it into a leaf node) would improve the tree's performance on validation data.
Process:
Core process remains identical - Evaluation Criteria may change
(added bit brief idea because every time I need to google it inorder to understand intution behind post-prunning)
Cost Complexity Prunning & ccp_alpha
Core Idea : Think of ccp_alpha as a "Penalty for Complexity"
goals
ccp_alpha is like deciding "How much details to sacrifice for simplicity"
Low alpha = Complex tree (many nodes) High alpha = Simple tree (fewer nodes)
Cost = alpha * #nodes
Only split if: Benefit > Cost
Higher Penalty for Misclassification
The penalty (controlled by ccp_alpha) increases as we prune nodes or subtrees. Essentially, when you prune (remove a node), you are trading off a simpler tree with potentially higher misclassification (since you might remove useful branches).
Real-World Analogy 🌍
Think of studying for an exam:
The optimal study plan keeps what’s essential and removes unnecessary details—just like the best ccp_alpha!
Procedure 🎯
Remember ⚡
So, pruning happens starting from the least penalized (lower ccp_alpha values) and moves towards higher values. Higher ccp_alpha values generally result in a more compact tree with fewer nodes.
Think of alpha as Risk you take
def plot_ccp_pruning(alphas, train_scores, test_scores, node_counts, depths):
"""Helper to plot CCP results in separate rows"""
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(10, 15))
# Plot 1: F1 Score vs Alpha
ax1.plot(alphas, train_scores, label='train', marker='o')
ax1.plot(alphas, test_scores, label='test', marker='o')
ax1.set_xlabel('alpha')
ax1.set_ylabel('score')
ax1.legend()
ax1.set_title('Score vs Alpha')
ax1.grid(True)
# Plot 2: Nodes vs Alpha
ax2.plot(alphas, node_counts, color='red', marker='o')
ax2.set_xlabel('alpha')
ax2.set_ylabel('number of nodes')
ax2.set_title('Nodes vs Alpha')
ax2.grid(True)
# Plot 3: Depth vs Alpha
ax3.plot(alphas, depths, color='green', marker='o')
ax3.set_xlabel('alpha')
ax3.set_ylabel('tree depth')
ax3.set_title('Depth vs Alpha')
ax3.grid(True)
plt.tight_layout()
plt.show()
def perform_ccp_pruning(
X_train,
X_test,
y_train,
y_test,
scoring_func=f1_score,
random_state=SEED,
should_balance_target=False,
):
"""Perform Cost Complexity Pruning
This function performs cost complexity pruning on a decision tree classifier.
It evaluates the performance of the tree at different levels of complexity (controlled by ccp_alpha).
By default it uses F1-score as scoring function.
"""
target_weight = "balanced" if should_balance_target else None
# Step 1: Create and fit full tree first
dt = DecisionTreeClassifier(
random_state=random_state,
class_weight=target_weight,
)
dt.fit(X_train, y_train)
# Step 2: Get the sequence of alphas from fitted tree
path = dt.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas[:-1] # Remove last alpha (ie Root Node)
# Step 3: Train models with different alphas
train_scores = []
test_scores = []
node_counts = []
depths = []
for alpha in alphas:
dt = DecisionTreeClassifier(
ccp_alpha=alpha,
random_state=random_state,
class_weight=target_weight,
)
dt.fit(X_train, y_train)
# Get predictions
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)
# Calculate scores using provided scoring function
train_score = scoring_func(y_train, y_train_pred)
test_score = scoring_func(y_test, y_test_pred)
# Store results
train_scores.append(train_score)
test_scores.append(test_score)
node_counts.append(dt.tree_.node_count)
depths.append(dt.get_depth())
# Step 3: Find best alpha
best_alpha_idx = np.argmax(test_scores)
best_alpha = alphas[best_alpha_idx]
# Step 4: Train final model with best alpha
final_tree = DecisionTreeClassifier(
ccp_alpha=best_alpha,
random_state=random_state,
class_weight=target_weight,
)
final_tree.fit(X_train, y_train)
# Step 5: Print results
print(f"Best alpha: {best_alpha}")
print(f"Number of nodes in pruned tree: {final_tree.tree_.node_count}")
print(f"Tree depth: {final_tree.get_depth()}")
print("\nClassification Report:")
print(classification_report(y_test, final_tree.predict(X_test)))
# Step 6: Plot results
plot_ccp_pruning(alphas, train_scores, test_scores, node_counts, depths)
return final_tree
dt_postprunned_f1 = perform_ccp_pruning(X_train, X_test, y_train, y_test, scoring_func=f1_score)
Best alpha: 0.0006000000000000003
Number of nodes in pruned tree: 41
Tree depth: 8
Classification Report:
precision recall f1-score support
0 0.99 1.00 0.99 895
1 0.97 0.93 0.95 105
accuracy 0.99 1000
macro avg 0.98 0.96 0.97 1000
weighted avg 0.99 0.99 0.99 1000
y_train_pred = dt_postprunned_f1.predict(X_train)
y_test_pred = dt_postprunned_f1.predict(X_test)
dt_postprunned_f1_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_postprunned_f1_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.93 | 0.97 | 0.95 |
dt_postprunned_f1_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_postprunned_f1_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.93 | 0.97 | 0.95 |
NOTE 💡: This is best amongst all (seen till now)
plot_confusion_matrix(y_test, y_test_pred)
get_tree_stats(dt_postprunned_f1)
Tree Statistics: Number of nodes: 41 Tree depth: 8
NOTE: Tree depth is also small and performance is also best till now
# Get and display feature importances
importances_postprunned_f1 = get_feature_importances(dt_postprunned_f1, X_train)
print("\nFeature Importances:")
print(importances_postprunned_f1)
Feature Importances:
feature importance
5 Education 0.40
2 Income 0.32
3 Family 0.19
4 CCAvg 0.04
1 Experience 0.02
8 CD_Account 0.02
0 Age 0.01
9 Online 0.01
6 Mortgage 0.00
7 Securities_Account 0.00
10 CreditCard 0.00
plot_decision_tree(dt_postprunned_f1, X_train)
print_tree_rules(dt_postprunned_f1, X_train.columns)
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2892.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Education <= 1.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Education > 1.50 | | | | |--- weights: [34.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 82.50 | | | | |--- weights: [85.00, 6.00] class: 0 | | | |--- Income > 82.50 | | | | |--- Family <= 2.50 | | | | | |--- Experience <= 33.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [2.00, 3.00] class: 1 | | | | | | |--- Experience > 3.50 | | | | | | | |--- weights: [72.00, 6.00] class: 0 | | | | | |--- Experience > 33.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- weights: [6.00, 2.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Age <= 57.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- Income <= 89.00 | | | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | | | |--- Income > 89.00 | | | | | | | | |--- weights: [1.00, 5.00] class: 1 | | | | | |--- Age > 57.00 | | | | | | |--- weights: [9.00, 2.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [6.00, 14.00] class: 1 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [463.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 59.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- Experience <= 32.00 | | | | |--- CCAvg <= 2.80 | | | | | |--- Experience <= 18.00 | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | |--- Experience > 18.00 | | | | | | |--- weights: [1.00, 4.00] class: 1 | | | | |--- CCAvg > 2.80 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- Experience > 32.00 | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 237.00] class: 1
# plot to display feature importance
plot_feature_importance(importances_postprunned_f1)
def compare_model_performance(model_metrics, model_names=None):
"""
Compare performance metrics of different decision tree models
Parameters:
-----------
model_metrics : list of DataFrames
Performance metric DataFrames for each model
model_names : list of str, optional
Names to use for each model. If not provided, will use generic names
Returns:
--------
DataFrame
Combined performance metrics with models as columns
"""
# Concatenate all model metrics
comparison_df = pd.concat([m.T for m in model_metrics], axis=1)
# Set column names
if model_names:
comparison_df.columns = model_names
else:
comparison_df.columns = [f"Model {i+1}" for i in range(len(model_metrics))]
return comparison_df
model_names = [
"Decision Tree (default)",
"Decision Tree (Balanced)",
"Decision Tree (Pre-Pruning F1)",
"Decision Tree (Pre-Pruning Recall)",
"Decision Tree (Pre-Pruning F1 Weighted & Balanced)",
"Decision Tree (Pre-Pruning F1 Weighted then Balanced)",
"Decision Tree (Post-Pruning F1)",
]
Training Performance Comparision
model_metrics = [
dt_default_train_metrics,
dt_balanced_train_metrics,
dt_preprunned_f1_train_metrics,
dt_preprunned_recall_train_metrics,
dt_preprunned_f1_weighted_train_metrics,
dt_preprunned_f1_wei_bal_train_metrics,
dt_postprunned_f1_train_metrics,
]
models_train_comp_df = compare_model_performance(
model_metrics,
model_names=model_names
)
print("Training performance comparison:")
models_train_comp_df.T
Training performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Decision Tree (default) | 1.00 | 1.00 | 1.00 | 1.00 |
| Decision Tree (Balanced) | 1.00 | 1.00 | 1.00 | 1.00 |
| Decision Tree (Pre-Pruning F1) | 0.99 | 0.93 | 0.98 | 0.96 |
| Decision Tree (Pre-Pruning Recall) | 0.99 | 0.89 | 0.97 | 0.93 |
| Decision Tree (Pre-Pruning F1 Weighted & Balanced) | 0.99 | 1.00 | 0.93 | 0.96 |
| Decision Tree (Pre-Pruning F1 Weighted then Balanced) | 0.98 | 1.00 | 0.86 | 0.93 |
| Decision Tree (Post-Pruning F1) | 0.99 | 0.93 | 0.97 | 0.95 |
model_metrics = [
dt_default_test_metrics,
dt_balanced_test_metrics,
dt_preprunned_f1_test_metrics,
dt_preprunned_recall_test_metrics,
dt_preprunned_f1_weighted_test_metrics,
dt_preprunned_f1_wei_bal_test_metrics,
dt_postprunned_f1_test_metrics,
]
models_test_comp_df = compare_model_performance(
model_metrics,
model_names=model_names
)
print("Test performance comparison:")
models_test_comp_df.T
Test performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Decision Tree (default) | 0.99 | 0.93 | 0.95 | 0.94 |
| Decision Tree (Balanced) | 0.99 | 0.92 | 0.96 | 0.94 |
| Decision Tree (Pre-Pruning F1) | 0.99 | 0.91 | 0.95 | 0.93 |
| Decision Tree (Pre-Pruning Recall) | 0.98 | 0.89 | 0.95 | 0.92 |
| Decision Tree (Pre-Pruning F1 Weighted & Balanced) | 0.98 | 0.92 | 0.91 | 0.92 |
| Decision Tree (Pre-Pruning F1 Weighted then Balanced) | 0.97 | 0.95 | 0.83 | 0.89 |
| Decision Tree (Post-Pruning F1) | 0.99 | 0.93 | 0.97 | 0.95 |
Insights 💡
Analyzing both training and test performance, we observe that the default and balanced Decision Tree models overfit, achieving perfect scores on the training set but showing a slight performance drop on the test set. This suggests a need for regularization via pruning.
Pre-Pruning (F1 & Recall) → Reduced model complexity but slightly lower recall on both training and test sets.
Pre-Pruning (F1 Weighted then Balanced) → Boosted recall (1.00) but compromised precision (0.86) on training, leading to an imbalanced model.
Pre-Pruning (F1 Weighted & Balanced) → holds good score
Recommendation: The Post-Pruning F1 model offers the best tradeoff, generalizing well to unseen data while keeping high recall and precision. This makes it the most reliable choice for deployment. 🚀
Decision Rule
print_tree_rules(dt_postprunned_f1, X_train.columns)
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2892.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Education <= 1.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Education > 1.50 | | | | |--- weights: [34.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 82.50 | | | | |--- weights: [85.00, 6.00] class: 0 | | | |--- Income > 82.50 | | | | |--- Family <= 2.50 | | | | | |--- Experience <= 33.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [2.00, 3.00] class: 1 | | | | | | |--- Experience > 3.50 | | | | | | | |--- weights: [72.00, 6.00] class: 0 | | | | | |--- Experience > 33.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- weights: [6.00, 2.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Age <= 57.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- Income <= 89.00 | | | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | | | |--- Income > 89.00 | | | | | | | | |--- weights: [1.00, 5.00] class: 1 | | | | | |--- Age > 57.00 | | | | | | |--- weights: [9.00, 2.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [6.00, 14.00] class: 1 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [463.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 59.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- Experience <= 32.00 | | | | |--- CCAvg <= 2.80 | | | | | |--- Experience <= 18.00 | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | |--- Experience > 18.00 | | | | | | |--- weights: [1.00, 4.00] class: 1 | | | | |--- CCAvg > 2.80 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- Experience > 32.00 | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 237.00] class: 1
Feature Importance
plot_feature_importance(importances_postprunned_f1)
importances_postprunned_f1
| feature | importance | |
|---|---|---|
| 5 | Education | 0.40 |
| 2 | Income | 0.32 |
| 3 | Family | 0.19 |
| 4 | CCAvg | 0.04 |
| 1 | Experience | 0.02 |
| 8 | CD_Account | 0.02 |
| 0 | Age | 0.01 |
| 9 | Online | 0.01 |
| 6 | Mortgage | 0.00 |
| 7 | Securities_Account | 0.00 |
| 10 | CreditCard | 0.00 |
Insights 📌
NOTE: Trying below Post Prunned Model (For Curiosity)
dt_postprunned_recall = perform_ccp_pruning(
X_train,
X_test,
y_train,
y_test,
scoring_func=recall_score
)
Best alpha: 0.0
Number of nodes in pruned tree: 125
Tree depth: 13
Classification Report:
precision recall f1-score support
0 0.99 0.99 0.99 895
1 0.95 0.93 0.94 105
accuracy 0.99 1000
macro avg 0.97 0.96 0.97 1000
weighted avg 0.99 0.99 0.99 1000
y_train_pred = dt_postprunned_recall.predict(X_train)
y_test_pred = dt_postprunned_recall.predict(X_test)
dt_postprunned_recall_train_metrics = get_classification_metrics(y_train, y_train_pred)
dt_postprunned_recall_train_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
dt_postprunned_recall_test_metrics = get_classification_metrics(y_test, y_test_pred)
dt_postprunned_recall_test_metrics
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.93 | 0.95 | 0.94 |
Here it seems to be still overfitted post pre-prunning just for recall, hence may be Data is not enough for a model tp capture such gist !
Key Takeaways & Recommendations for Personal Loan Campaign 🎯
Customer Profile for Targeting
Education Level is Critical
Income is a Strong Indicator
Family Size Matters
Credit Card Spending
What Not to Focus On
Campaign Recommendations
Credit Card Market Opportunity
Primary Target Segment
Marketing Message
Channel Strategy
Risk Awareness
Finance, Loan Behavior & Credit Card
High Mortgage & Credit Spending Patterns
Medium Mortgage Impact
Debt Comfort Levels
Additional Recommendations
Product Bundling Strategy
Cross-Selling Opportunities
Marketing Message Refinement
Success Metrics
Suggestion: Also try to collect more customer data if possible
This data-driven approach should help focus resources on the most promising customer segments while maximizing campaign effectiveness. 🎯